DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
187 stars 44 forks source link

blobtools map2cov fails using bam file from blasr aligner #41

Closed JPascualAnaya closed 7 years ago

JPascualAnaya commented 7 years ago

Hi there,

I am using blobtools with a PacBio-only assembly and I'm getting problems to get the coverage file from an alignment done with blasr (PB reads against assembly) in bam. map2cov makes a WARNING for each of the contigs in the assembly, complaining that they do not belong to such assembly, and finally no coverage info is written to the cov file. Any idea what the problem might be and how to solve this?

Software: bloobtools v0.9.19.3 blasr v5.3.8c16f52 samtools 1.3.1

Running on a Centos 6.8

First, I assembled a preliminary assembly with Canu, using PB subreads in fastq, then aligned the PB reads to this preliminary assembly using blasr, output in bam format. Commands:

blasr input.fofn $ASSEMBLY/canu20160810/Ttra_canu1.contigs.fasta --bam --nproc 24 --out fastq_vs_TtraCanu1.default.blasr.bam > fastq_vs_TtraCanu1.default.blasr.log 2>&1
samtools sort -o fastq_vs_TtraCanu1.default.blasr.sorted fastq_vs_TtraCanu1.default.blasr.bam
samtools index fastq_vs_TtraCanu1.default.blasr.sorted.bam

virtualenv ~/virtualenvs/blobtools
source ~/virtualenvs/blobtools/bin/activate
~/Software/blobtools/blobtools map2cov -i $ASSEMBLY/canu20160810/Ttra_canu1.contigs.fasta -b fastq_vs_TtraCanu1.default.blasr.sorted.bam -o fastq_vs_TtraCanu1.default.blasr > fastq_vs_TtraCanu1.default.blasr.map2cov.log 2>&1

This is the STDOUT I get:

head fastq_vs_TtraCanu1.default.blasr.map2cov.log 
[STATUS]        : Parsing FASTA - /home/champi/Documents/TerebrataliaGenome/assemblies/canu/canu20160810/Ttra_canu1.contigs.fasta
[STATUS]        : Parsing bam0 - /home/champi/Documents/TerebrataliaGenome/assemblies/canu/blobologyFiltering/fastq_vs_TtraCanu1.default.blasr.sorted.bam
[STATUS]        :       Checking with 'samtools flagstat'
[STATUS]        :       Mapping reads = 2,596,752, total reads = 16,556,960 (mapping rate = 15.7%)
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
[WARN]          : tig00000000 is not part of the assembly
$ tail fastq_vs_TtraCanu1.default.blasr.map2cov.log
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[WARN]          : tig00011812 is not part of the assembly
[PROGRESS]      :       100%
[STATUS]        :       Writing fastq_vs_TtraCanu1.default.blasr.sorted.bam.cov
[WARN]          : Sum of coverage in cov lib bam0 is 0.0. Please ignore this warning if "--no_base_cov" was specified.

And this is the output file:

$ head fastq_vs_TtraCanu1.default.blasr.sorted.bam.cov
## blobtools v0.9.19.3
## Total Reads = 16556960
## Mapped Reads = 2596752
## Unmapped Reads = 13960208
## Source(s) : /home/champi/Documents/TerebrataliaGenome/assemblies/canu/blobologyFiltering/fastq_vs_TtraCanu1.default.blasr.sorted.bam
# contig_id     read_cov        base_cov
tig00000000     0       0.0
tig00000001     0       0.0
tig00000002     0       0.0
tig00000003     0       0.0
JPascualAnaya commented 7 years ago

Hi there,

I repeated all using the sam file, and changing all whitespaces in the headers of the assembly fasta to underscores, and I still get the same issue (even though I got all the reads mapped...)

$ head baxh5_vs_TtraCanu1.default.blasr.sam.cov
## blobtools v0.9.19.3
## Total Reads = 17401374
## Mapped Reads = 17401374
## Unmapped Reads = 0
## Source(s) : /home/champi/Documents/TerebrataliaGenome/assemblies/canu/blobologyFiltering/baxh5_vs_TtraCanu1.default.blasr.sam
# contig_id     read_cov        base_cov
tig00000000_len=13620_reads=12_covStat=17.50_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
tig00000001_len=19185_reads=11_covStat=33.92_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
tig00000002_len=21833_reads=37_covStat=35.63_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
tig00000003_len=34409_reads=65_covStat=62.81_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
$ tail baxh5_vs_TtraCanu1.default.blasr.sam.cov
tig00011780_len=15717_reads=44_covStat=21.52_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
tig00011783_len=7536_reads=6_covStat=14.11_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no      0       0.0
tig00011786_len=2405_reads=14_covStat=-3.87_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no     0       0.0
tig00011789_len=4318_reads=5_covStat=7.03_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no       0       0.0
tig00011792_len=9055_reads=7_covStat=20.15_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no      0       0.0
tig00011802_len=20122_reads=41_covStat=30.93_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
tig00011805_len=4087_reads=8_covStat=4.58_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no       0       0.0
tig00011808_len=37484_reads=88_covStat=51.07_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
tig00011811_len=23810_reads=43_covStat=40.01_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
tig00011812_len=17599_reads=21_covStat=43.39_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no    0       0.0
DRL commented 7 years ago

Hi JPascualAnaya ,

Blobtools needs the IDs of sequences (headers in FASTA, qseqids in BLAST report, contigIDs in BAM) to be the same. That's why it complains.

cheers,

dom

JPascualAnaya commented 7 years ago

Hi Dom,

As fas as I can tell, after changing the whitespaces for underscores in the assembly and rerunning all, the names are exactly the same.

$ grep ">" Ttra_canu1.contigs.1line_nospaceID.fasta | head
>tig00000000_len=13620_reads=12_covStat=17.50_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000001_len=19185_reads=11_covStat=33.92_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000002_len=21833_reads=37_covStat=35.63_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000003_len=34409_reads=65_covStat=62.81_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000004_len=22174_reads=41_covStat=25.69_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000006_len=25564_reads=29_covStat=52.66_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000008_len=25598_reads=42_covStat=40.10_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000010_len=63521_reads=127_covStat=122.90_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000011_len=24718_reads=25_covStat=45.70_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000012_len=31414_reads=39_covStat=70.95_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
DRL commented 7 years ago

Hi JPascualAnaya,

are you still having that issue?

cheers,

dom

JPascualAnaya commented 7 years ago

Hi Dom,

I ended up to extracting the reads and covStat info myself from the headers, but never got map2cov to work.

Juan