Coverage file incorrect

Estel-Kitsune commented 7 years ago

Hi,

I seem to have a problem with the coverage file generated by blobtools. This is the head of my .cov file (the alignment was done with bwa 0.7.8):

blobtools v0.9.19

Total Reads = 42239014

Mapped Reads = 50620400

Unmapped Reads = -8381386

The total reads number is correct. I guess I really might have 8381386 unmapped reads , and then 42239014 - 8381386 = 33,857,628 mapped reads.

And the stat file looks like this:

blobtools v0.9.19

bam0=mybam.bam

name colour count_visible count_visible_perc span_visible span_visible_perc n50 gc_mean gc_std bam0_mean bam0_std bam0_read_map bam0_read_map_p

all None 63,908 100.0% 59,761,021 100.0% 1,345 0.4 0.047 90.3 872.7 50,620,400 119.8%

Any idea about what causes this error?

Thanks, Estelle

DRL commented 7 years ago

Hi,

what version of samtools are you using?

cheers,

dom

Estel-Kitsune commented 7 years ago

Hi DRL,

I used samtools 1.3 when running blobtools.

Cheers, E

DRL commented 7 years ago

interesting, ...

the thing is that you have more mapped reads as total reads, which does not make any sense.

There must be some trouble with your bam file.

Can you run samtools flagstat mybam.bam and send me the output?

Estel-Kitsune commented 7 years ago

Hi,

This is the output of samtools flagstat. I guess there's something wrong with my BAM file.

51876606 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
9637592 + 0 supplementary
0 + 0 duplicates
50620400 + 0 mapped (97.58% : N/A)
42239014 + 0 paired in sequencing
21119507 + 0 read1
21119507 + 0 read2
38371134 + 0 properly paired (90.84% : N/A)
40701846 + 0 with itself and mate mapped
280962 + 0 singletons (0.67% : N/A)
2302704 + 0 with mate mapped to a different chr
2158868 + 0 with mate mapped to a different chr (mapQ>=5)

DRL commented 7 years ago

The BAM file seems to look good. Otherwise samtools would have complained. The output looks like there should be no problem for blobtools to parse this ...

Can you check whether you might have other samtools versions in your path? echo $PATH samtools --version

And could you also send me the command you used for generating the COV file?

Estel-Kitsune commented 7 years ago

Hi,

With our module system we can't have different versions of a tool at the same time.

But I tried it a second time, not using modules but a conda environment with samtools 1.3.1. My command is:

python /home/estelle/bin/blobtools/blobtools create -i Trinity_bestiso_nozero_l300.fa -b sort.Trinity_bestiso_nozero_l300.bam -t all.blastx.swissprot.blastout -t all.megablastn.nt.blastout -t Trinity_bestiso_nozero_l300.vs.silva.1e65.megablast.out -o my_second_blob --nodes /proj/b2013006/nobackup/sw/data/blobology/nodes.dmp --names /proj/b2013006/nobackup/sw/data/blobology/names.dmp

and the output is still problematic (mapping rate of 119.8%):

`[STATUS] : Parsing FASTA - Trinity_bestiso_nozero_l300.fa

[STATUS] : Creating nodesDB from /proj/b2013006/nobackup/sw/data/blobology/nodes.dmp and /proj/b2013006/nobackup/sw/data/blobology/names.dmp

[STATUS] : Parsing tax0 - /pica/v12/b2011127_nobackup/private/blobtools/all.blastx.swissprot.blastout

[STATUS] : Parsing tax1 - /pica/v12/b2011127_nobackup/private/blobtools/all.megablastn.nt.blastout

[STATUS] : Parsing tax2 - /pica/v12/b2011127_nobackup/private/blobtools/Trinity_bestiso_nozero_l300.vs.silva.1e65.megablast.out

[STATUS] : Computing taxonomy using taxrule(s) bestsum

[PROGRESS] : 100%

[STATUS] : Parsing bam0 - /pica/v12/b2011127_nobackup/private/blobtools/sort.Trinity_bestiso_nozero_l300.bam

[STATUS] : Checking with 'samtools flagstat'

[STATUS] : Mapping reads = 50,620,400, total reads = 42,239,014 (mapping rate = 119.8%)

[PROGRESS] : 100%

[STATUS] : Writing sort.Trinity_bestiso_nozero_l300.bam.cov

[STATUS] : Generating BlobDB and writing to file my_second_blob.blobDB.json `

DRL commented 7 years ago

Hi,

first two comments:

you can just write /home/estelle/bin/blobtools/blobtools create, you should not need to write python /home/estelle/bin/blobtools/blobtools create as blobtools uses the python version in your current environment.
could it be that you are blobtool'ing a transcriptome? If so, be aware that coverage does not mean the same thing for transcriptome assemblies as for genome assemblies ...

Blobtools has successfully finished.

So actually everything went ok-ish. The only problem, as you have spotted, is that it incorrectly inferred the mapping rate. I fixed that and if you do git pull in the blobtools folder you should get the newest version of the master-branch. The error was that blobtools did: reads_mapped = reads_mapped - reads_secondary - reads_secondary instead of: reads_mapped = reads_mapped - reads_secondary - reads_supplementary (although this has minimal influence in the final result)

can you pull the master branch and confirm that the mapping rate is sensible now?

Estel-Kitsune commented 7 years ago

Yes I'm blobtool'ing a transcriptome! :)

And now the mapping info is correct:

[STATUS] : Mapping reads = 40,982,808, total reads = 42,239,014 (mapping rate = 97.0%)

Thanks a lot for your help!

DRL commented 7 years ago

Thank YOU for spotting the bug!

I actually have never done a blobplot of a transcriptome, since I work mostly on genome assemblies (but I am aware that people do this).

If you have any ideas/suggestions re best-practices for transcriptome filtering/blobtool'ing practices, please let me know and I can add them to the readme.io

cheers,

dom

Estel-Kitsune commented 7 years ago

Sorry to re-open this thread, but 2 things:

Now blobtools create gives me a warning about my BAM, so maybe there's something wrong with my file in the end...

[STATUS] : Checking with 'samtools flagstat' [STATUS] : Mapping reads = 40,982,808, total reads = 42,239,014 (mapping rate = 97.0%) [PROGRESS] : 100% [PROGRESS] : 123% [WARN] : Based on samtools flagstat: expected 40982808 reads, 50620400 reads were parsed [STATUS] : Writing sort.Trinity_bestiso_nozero_l300.bam.cov [STATUS] : Generating BlobDB and writing to file my_second_blob.blobDB.json

The COV header:

blobtools v0.9.19

Total Reads = 42239014

Mapped Reads = 40982808

Unmapped Reads = 1256206

The stats file and the PNG created by blobtools blobplot are still incorrect (identical to before the bug), with the "all" map percentage equal to 119.8%.

Cheers, Estelle

DRL commented 7 years ago

Hmm, ...

The warning is because blobtools expected different numbers based on the samtools flagstat output.

Changed the code, could you pull and check?

Estel-Kitsune commented 7 years ago

Hi,

I got the same output after the pull.

/E

DRL / blobtools

Coverage file incorrect #39

blobtools v0.9.19

Total Reads = 42239014

Mapped Reads = 50620400

Unmapped Reads = -8381386

blobtools v0.9.19

bam0=mybam.bam

name colour count_visible count_visible_perc span_visible span_visible_perc n50 gc_mean gc_std bam0_mean bam0_std bam0_read_map bam0_read_map_p

blobtools v0.9.19

Total Reads = 42239014

Mapped Reads = 40982808

Unmapped Reads = 1256206