Xinglab / isoCirc

isoCirc
GNU General Public License v3.0
10 stars 4 forks source link

ValueError: file has no sequences defined (mode='r') #7

Open getziadz opened 2 years ago

getziadz commented 2 years ago

I am getting the following error when running data with human genome and the gtf file with circbase bed file.

== 10:33:39-Nov-06-2021 == [Mapping] Mapping consensus sequence to genome done!
== 10:33:39-Nov-06-2021 == [Classifying] Classifying consensus alignment ...
== 10:33:39-Nov-06-2021 == [classify_bam_core] Processing ./cons.fa.sam ... 
Traceback (most recent call last):
  File "/home/user/anaconda3/bin/isocirc", line 219, in <module>
    main()
  File "/home/user/anaconda3/bin/isocirc", line 216, in main
    isocirc_core(args)
  File "/home/user/anaconda3/bin/isocirc", line 117, in isocirc_core
    bc.bam_classify(cons_all_sam, high_bam, low_bam, args.high_max_ratio, args.high_min_ratio, args.high_iden_ratio, args.high_repeat_ratio, args.low_repeat_ratio)
  File "/home/user/anaconda3/lib/python3.8/site-packages/isocirc/bam_classify.py", line 168, in bam_classify
    with ps.AlignmentFile(in_bam_fn) as in_bam, ps.AlignmentFile(high_bam_fn, 'wb', template=in_bam) as high_bam, \
  File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False

I am using the latest version of isoCirc. Any idea on how I can resolve the issue?

yangao07 commented 2 years ago

Hi, can you paste the output of this command in output folder:

ls -l

Seems like the cons.fa.sam is empty.

getziadz commented 2 years ago

Hi, the file is not empty:

-rw-rw-r--  1 user  16G Nov  6 10:10 cons.fa.sam

the first line of the file also looks correct:

@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -ax splice -ub --MD --eqx -t 8 Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./cons.fa
yangao07 commented 2 years ago

There should be SQ lines in the sam header. Can you paste all the header of cons.fa.sam? Also, not sure if the reference sequence is correct. Can you try fxtools lp Homo_sapiens.GRCh38.dna.toplevel.fa.gz and paste the output?

getziadz commented 2 years ago

Hi, Thank you for your responses.

Yes, you are right, it does not have SQ line. I am not sure what could be the reason though:

@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -ax splice -ub --MD --eqx -t 8 Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./cons.fa

de442b94-5914-48e7-be85-347fcab3600f_cons0      0       13      42168439        60      646S15=2D75=1D1=1D15=4D126N29=2D67=1D4=1I23=2I14=5241N67=1X17=3990N69=1I17=8828N3=1D1=1D55=1D38=1887N44=1D102=1D11=1X37=1I77=1093N22=1D12=1X28=2D39=1D17=259S   *       0       0

       CATATGAACTCAAATTGCCACCAAAAGCTTCCCTACT (and the sequence continues)

I also attached the output of the fxtools here. fxtools_output.txt

Thanks!

yangao07 commented 2 years ago

The problem is the reference genome you used. See https://github.com/lh3/minimap2/issues/58 and https://github.com/lh3/minimap2/issues/37 for more information. Changing to a non-toplevel/primary_assembly reference should work.

Yan

getziadz commented 2 years ago

Hi Yan, Thank you for your help. The above suggestion worked and solved the problem of the SQ lines. The sam file now starts with multiple SQ lines (I am pasting an excerpt below), is this what we expected?:

@SQ     SN:1    LN:248956422
@SQ     SN:10   LN:133797422
@SQ     SN:11   LN:135086622
@SQ     SN:12   LN:133275309
@SQ     SN:13   LN:114364328
@SQ     SN:14   LN:107043718
@SQ     SN:15   LN:101991189
@SQ     SN:16   LN:90338345
@SQ     SN:17   LN:83257441
@SQ     SN:18   LN:80373285
@SQ     SN:19   LN:58617616
@SQ     SN:2    LN:242193529
@SQ     SN:20   LN:64444167
@SQ     SN:21   LN:46709983
@SQ     SN:22   LN:50818468
@SQ     SN:3    LN:198295559
@SQ     SN:4    LN:190214555
@SQ     SN:5    LN:181538259
@SQ     SN:6    LN:170805979
@SQ     SN:7    LN:159345973
@SQ     SN:8    LN:145138636
@SQ     SN:9    LN:138394717
@SQ     SN:MT   LN:16569
@SQ     SN:X    LN:156040895
@SQ     SN:Y    LN:57227415
@SQ     SN:KI270728.1   LN:1872759
@SQ     SN:KI270727.1   LN:448248

However, now there is a new error:

== 21:11:45-Nov-09-2021 == [classify_bam_core] Processing primary_ref/cons.fa.sam done.
== 21:11:45-Nov-09-2021 == [Classifying] Classifying consensus alignment done!
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 370, in __init__
    self.file = self._fasta_opener(filename, 'r+b'
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 273, in open
    return BgzfReader(filename, mode)
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 584, in __init__
    self._load_block(handle.tell())
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 611, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 444, in _load_bgzf_block
    raise ValueError(
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\x1f\x8b\x08\x00'; handle.tell() now says 4

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/anaconda3/bin/isocirc", line 219, in <module>
    main()
  File "/home/user/anaconda3/bin/isocirc", line 216, in main
    isocirc_core(args)
  File "/home/user/anaconda3/bin/isocirc", line 132, in isocirc_core
    hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa,
  File "/home/user/anaconda3/lib/python3.8/site-packages/isocirc/hcBSJ_fullIso.py", line 761, in hcBSJ_fullIso
    ref_fa, cons_fa = Fasta(ref), Fasta(cons)
  File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 998, in __init__
    self.faidx = Faidx(
  File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 374, in __init__
    raise UnsupportedCompressionFormat(
pyfaidx.UnsupportedCompressionFormat: Compressed FASTA is only supported in BGZF format. Use the samtools bgzip utility (instead of gzip) to compress your FASTA.

FYI, my fastq file is not zipped, it is just fastq, and I did not touch the reference genomes, they are zipped as they come from NCBI. Any suggestions for this error?

Thank you for your time on this.

yangao07 commented 2 years ago

pyfaidx.UnsupportedCompressionFormat: Compressed FASTA is only supported in BGZF format

The pyfaidx only accepts BGZF compressed .gz fasta file. I guess the reference genome you downloaded from NCBI is gziped'. You can simply unzip it and re-run isocirc.

getziadz commented 2 years ago

Hi, Thank you for the suggestion, it worked all the way through when I unzipped the genome files.

One last question I have is, in the stats.out file, although I used human RNA, it finds a lot of reads with BSJs, and total candidate BSJ, it reports that total known high confidence BSJs are 0. I used the CircBase’s human circRNA bed file. Why could that be? Shouldn’t some of the circRNAs detected be the known ones?

yangao07 commented 2 years ago

The circBase coordinates are hg19, did you convert them to hg38?

getziadz commented 2 years ago

Hi Yan, Thank you for your suggestion, that was a great catch. I indeed had not convert it to the hg38. I was sure your suggestion was going to solve the problem but I converted it using LiftOver and reran isocirc, but I got the same results. Known BSJs are zero, and in the out all isoform IDs are “isocirc0”, “isocirc1”… etc. Do you have any other ideas of how I can fix it?

yangao07 commented 2 years ago

In addition to the liftover, the chromosome names should also be the same, like, "chr1" and "1" should be consistent.