Open getziadz opened 3 years ago
Hi, can you paste the output of this command in output folder:
ls -l
Seems like the cons.fa.sam
is empty.
Hi, the file is not empty:
-rw-rw-r-- 1 user 16G Nov 6 10:10 cons.fa.sam
the first line of the file also looks correct:
@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -ax splice -ub --MD --eqx -t 8 Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./cons.fa
There should be SQ
lines in the sam header. Can you paste all the header of cons.fa.sam
?
Also, not sure if the reference sequence is correct. Can you try fxtools lp Homo_sapiens.GRCh38.dna.toplevel.fa.gz
and paste the output?
Hi, Thank you for your responses.
Yes, you are right, it does not have SQ line. I am not sure what could be the reason though:
@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -ax splice -ub --MD --eqx -t 8 Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./cons.fa
de442b94-5914-48e7-be85-347fcab3600f_cons0 0 13 42168439 60 646S15=2D75=1D1=1D15=4D126N29=2D67=1D4=1I23=2I14=5241N67=1X17=3990N69=1I17=8828N3=1D1=1D55=1D38=1887N44=1D102=1D11=1X37=1I77=1093N22=1D12=1X28=2D39=1D17=259S * 0 0
CATATGAACTCAAATTGCCACCAAAAGCTTCCCTACT (and the sequence continues)
I also attached the output of the fxtools here. fxtools_output.txt
Thanks!
The problem is the reference genome you used. See https://github.com/lh3/minimap2/issues/58 and https://github.com/lh3/minimap2/issues/37 for more information. Changing to a non-toplevel/primary_assembly reference should work.
Yan
Hi Yan, Thank you for your help. The above suggestion worked and solved the problem of the SQ lines. The sam file now starts with multiple SQ lines (I am pasting an excerpt below), is this what we expected?:
@SQ SN:1 LN:248956422
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:2 LN:242193529
@SQ SN:20 LN:64444167
@SQ SN:21 LN:46709983
@SQ SN:22 LN:50818468
@SQ SN:3 LN:198295559
@SQ SN:4 LN:190214555
@SQ SN:5 LN:181538259
@SQ SN:6 LN:170805979
@SQ SN:7 LN:159345973
@SQ SN:8 LN:145138636
@SQ SN:9 LN:138394717
@SQ SN:MT LN:16569
@SQ SN:X LN:156040895
@SQ SN:Y LN:57227415
@SQ SN:KI270728.1 LN:1872759
@SQ SN:KI270727.1 LN:448248
However, now there is a new error:
== 21:11:45-Nov-09-2021 == [classify_bam_core] Processing primary_ref/cons.fa.sam done.
== 21:11:45-Nov-09-2021 == [Classifying] Classifying consensus alignment done!
Traceback (most recent call last):
File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 370, in __init__
self.file = self._fasta_opener(filename, 'r+b'
File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 273, in open
return BgzfReader(filename, mode)
File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 584, in __init__
self._load_block(handle.tell())
File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 611, in _load_block
block_size, self._buffer = _load_bgzf_block(handle, self._text)
File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 444, in _load_bgzf_block
raise ValueError(
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\x1f\x8b\x08\x00'; handle.tell() now says 4
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/anaconda3/bin/isocirc", line 219, in <module>
main()
File "/home/user/anaconda3/bin/isocirc", line 216, in main
isocirc_core(args)
File "/home/user/anaconda3/bin/isocirc", line 132, in isocirc_core
hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa,
File "/home/user/anaconda3/lib/python3.8/site-packages/isocirc/hcBSJ_fullIso.py", line 761, in hcBSJ_fullIso
ref_fa, cons_fa = Fasta(ref), Fasta(cons)
File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 998, in __init__
self.faidx = Faidx(
File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 374, in __init__
raise UnsupportedCompressionFormat(
pyfaidx.UnsupportedCompressionFormat: Compressed FASTA is only supported in BGZF format. Use the samtools bgzip utility (instead of gzip) to compress your FASTA.
FYI, my fastq file is not zipped, it is just fastq, and I did not touch the reference genomes, they are zipped as they come from NCBI. Any suggestions for this error?
Thank you for your time on this.
pyfaidx.UnsupportedCompressionFormat: Compressed FASTA is only supported in BGZF format
The pyfaidx
only accepts BGZF compressed .gz fasta file.
I guess the reference genome you downloaded from NCBI is gziped'.
You can simply unzip it and re-run isocirc.
Hi, Thank you for the suggestion, it worked all the way through when I unzipped the genome files.
One last question I have is, in the stats.out file, although I used human RNA, it finds a lot of reads with BSJs, and total candidate BSJ, it reports that total known high confidence BSJs are 0. I used the CircBase’s human circRNA bed file. Why could that be? Shouldn’t some of the circRNAs detected be the known ones?
The circBase coordinates are hg19, did you convert them to hg38?
Hi Yan, Thank you for your suggestion, that was a great catch. I indeed had not convert it to the hg38. I was sure your suggestion was going to solve the problem but I converted it using LiftOver and reran isocirc, but I got the same results. Known BSJs are zero, and in the out all isoform IDs are “isocirc0”, “isocirc1”… etc. Do you have any other ideas of how I can fix it?
In addition to the liftover, the chromosome names should also be the same, like, "chr1" and "1" should be consistent.
I am getting the following error when running data with human genome and the gtf file with circbase bed file.
I am using the latest version of isoCirc. Any idea on how I can resolve the issue?