DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
184 stars 44 forks source link

Can't parse BAM files #101

Closed Mpersicae closed 4 years ago

Mpersicae commented 4 years ago

Hello,

I am trying to create a blobplot using a PacBio assembly of an insect genome. Here is the code I am using:

~/blobtools/blobtools create -i ~/Desktop/M-persicae-canu-consensus-2.fasta \
-b /media/yonathan/LaCie/M-persicae-consensus-aligned.bam \
-t ~/blobtools/M-persicae-blastn-blob.taxified.out \
-o M-p-blob

When I run this I get this output:

[+] Parsing FASTA - /home/yonathan/Desktop/M-persicae-consensus-2.fasta
[+] Parsing bam0 - /media/yonathan/LaCie/M-persicae-canu-consensus-aligned.bam
[X] Please (sort and) index your BAM file

The BAM file I am using has already been indexed and sorted against the input FASTA using the pbmm2 tool from PacBio SMRTLink Tools 8.0. The FASTA file was generated using Canu 1.9 and polished twice using Arrow.

I have also tried generating a COV file from these files using map2cov and ran into the exact same issue. I am unsure how to proceed from here, any help you could give would be much appreciated.

-Y.

DRL commented 4 years ago

Hi,

I am not familiar with pbmm2... but it is likely that it is not writing a 'proper' BAM file that pysam (the library blobtools uses) can read ...

Can you map using minimap2 ?

An example command would be:

minimap2 -t 8 -ax map-pb -r2k assembly.fasta reads.fasta.gz | samtools sort -@4 -o reads.vs.assembly.bam && samtools index pb.reads.vs.spialia_orbifer.HU_SO_1031.racon_pb_3.racon_sr_2.bam

If for whatever reason you can't use minimap2, then please make a minimal BAM file of ~1000 reads and sent me a dropbox/googledrive link so i can take a look.

cheers,

dom

Mpersicae commented 4 years ago

Hi Dom,

Thanks for the quick reply.

pbmm2 is a wrapper for minimap2, I think they should be equivalent. In any case, I solved this problem - I didn't realize that the .bai file had to be in the same directory as the indexed BAM file for samtools to work.

I have run into another issue though. When I run the same code I provided in my original post, I get this Error:

[+] Parsing FASTA - /home/yonathan/Desktop/M-persicae-consensus-2.fasta
[+] names.dmp/nodes.dmp not specified. Retrieving nodesDB from /home/yonathan/blobtools/lib/../data/nodesDB.txt 2.25M/2.25M [00:04<00:00, 559kit/s]
[+] Parsing tax0 - /home/yonathan/blobtools/M-persicae-blastn-blob-1.taxified.out
[+] Computing taxonomy using taxrule(s) bestsum | 3.06k/3.06k [00:00<00:00, 31.0kit/s]
[+] Parsing bam0 - /media/yonathan/LaCie/M-persicae-canu-consensus-aligned.bam
[+] -> 100.00 (3055/3055) of sequences have reads aligned to them.
[+] -> 100.00 (3137/3137) of reads are mapped.
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
Traceback (most recent call last):
  File "/home/yonathan/blobtools/blobtools", line 7, in <module>
    main()
  File "/home/yonathan/blobtools/lib/interface.py", line 60, in main
    create.main()
  File "/home/yonathan/blobtools/lib/create.py", line 119, in main
    blobDb.parseCoverage(covLibObjs=cov_libs, estimate_cov=estimate_cov_flag, prefix=prefix)
  File "/home/yonathan/blobtools/lib/BtCore.py", line 369, in parseCoverage
    base_cov_dict, covLib.reads_total, covLib.reads_mapped, read_cov_dict = BtIO.parseBam(covLib.f, set(self.dict_of_blobs), estimate_cov)
  File "/home/yonathan/blobtools/lib/BtIO.py", line 224, in parseBam
    base_cov_dict, read_cov_dict = estimate_coverage(aln, set_of_blobs)
  File "/home/yonathan/blobtools/lib/BtIO.py", line 232, in estimate_coverage
    est_read_length = estimate_read_lengths(aln, set_of_blobs)
  File "/home/yonathan/blobtools/lib/BtIO.py", line 201, in estimate_read_lengths
    for read in aln.fetch(header):
  File "pysam/libcalignmentfile.pyx", line 2092, in pysam.libcalignmentfile.IteratorRowRegion.__next__
OSError: truncated file

I am not familiar with how pysam works, do you have any ideas as to why it would be pulling this error. Would you still like me to provide a subset of my BAM file?

Thanks, Y.

DRL commented 4 years ago

Hi Dom,

Thanks for the quick reply.

pbmm2 is a wrapper for minimap2, I think they should be equivalent. In any case, I solved this problem - I didn't realize that the .bai file had to be in the same directory as the indexed BAM file for samtools to work.

I have run into another issue though. When I run the same code I provided in my original post, I get this Error:

[+] Parsing FASTA - /home/yonathan/Desktop/M-persicae-consensus-2.fasta
[+] names.dmp/nodes.dmp not specified. Retrieving nodesDB from /home/yonathan/blobtools/lib/../data/nodesDB.txt 2.25M/2.25M [00:04<00:00, 559kit/s]
[+] Parsing tax0 - /home/yonathan/blobtools/M-persicae-blastn-blob-1.taxified.out
[+] Computing taxonomy using taxrule(s) bestsum | 3.06k/3.06k [00:00<00:00, 31.0kit/s]
[+] Parsing bam0 - /media/yonathan/LaCie/M-persicae-canu-consensus-aligned.bam
[+] -> 100.00 (3055/3055) of sequences have reads aligned to them.
[+] -> 100.00 (3137/3137) of reads are mapped.
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
Traceback (most recent call last):
  File "/home/yonathan/blobtools/blobtools", line 7, in <module>
    main()
  File "/home/yonathan/blobtools/lib/interface.py", line 60, in main
    create.main()
  File "/home/yonathan/blobtools/lib/create.py", line 119, in main
    blobDb.parseCoverage(covLibObjs=cov_libs, estimate_cov=estimate_cov_flag, prefix=prefix)
  File "/home/yonathan/blobtools/lib/BtCore.py", line 369, in parseCoverage
    base_cov_dict, covLib.reads_total, covLib.reads_mapped, read_cov_dict = BtIO.parseBam(covLib.f, set(self.dict_of_blobs), estimate_cov)
  File "/home/yonathan/blobtools/lib/BtIO.py", line 224, in parseBam
    base_cov_dict, read_cov_dict = estimate_coverage(aln, set_of_blobs)
  File "/home/yonathan/blobtools/lib/BtIO.py", line 232, in estimate_coverage
    est_read_length = estimate_read_lengths(aln, set_of_blobs)
  File "/home/yonathan/blobtools/lib/BtIO.py", line 201, in estimate_read_lengths
    for read in aln.fetch(header):
  File "pysam/libcalignmentfile.pyx", line 2092, in pysam.libcalignmentfile.IteratorRowRegion.__next__
OSError: truncated file

I am not familiar with how pysam works, do you have any ideas as to why it would be pulling this error. Would you still like me to provide a subset of my BAM file?

Thanks, Y.

Excellent, you solved it. No need for the BAM then.

The new error means that your file is truncated... not likely to be an issue related to blobtools.

If you copied it, check md5-checksums... if not, then map again.

I am going to close this. If you run into more problems, just open another issue.

cheers,

dom