DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
187 stars 44 forks source link

map2cov parses more reads than expected by samtools flagstat #46

Closed filip-husnik closed 7 years ago

filip-husnik commented 7 years ago

Hi Dominik,

My issue is similar to #39 where blobtools also reports 'more' reads than expected by flagstat. My bam file with corrected PacBio fasta reads mapped to the reference genome by bwa mem raises a warning with map2cov. It does not seem that the differences in numbers represent multimappings somehow missed by samflags used by map2cov. samtools view -f 1 -F 1024 -F 256 -F 2048 according to #40

I have quite a lot of supplementary reads. Is there maybe a problem in the way how BtIO.py gets the numbers of mapping and total reads? The warning says Based on samtools flagstat: expected 5031818 reads, 5529014 reads were parsed but when I run flagstat separately it correctly prints 5529014 + 0 mapped (99.71% : N/A)

Thanks!

Filip

My blobtools version is blobtools v0.9.19. I have only samtools 1.3.1 (+htslib 1.3.1) in my $PATH and this is its flagstat output:

5545330 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
497196 + 0 supplementary
0 + 0 duplicates
5529014 + 0 mapped (99.71% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

This is the map2cov warning:

[STATUS]    :   Checking with 'samtools flagstat'
[STATUS]    :   Mapping reads = 5,031,818, total reads = 5,048,134 (mapping rate = 99.7%)
[PROGRESS]  :   100%
[PROGRESS]  :   109% 
[WARN]      : Based on samtools flagstat: expected 5031818 reads, 5529014 reads were parsed
DRL commented 7 years ago

Hi Filip,

The command that is being run in the background in order to actually extract the reads is:

samtools view -f 1 -F 1024 -F 256 -F 2048

where the issue concerning your data is the -f 1 which mean the reads have to be paired, which I think yours are not. Hence the output will be wrong.

Bamfilter was meant for paired-end reads since SE reads can easily be grep'ed (which is also faster):

  1. grep the contig-names you want to keep/delete
  2. cut the readname column and put into file
  3. Use something like Heng Li's seqtk subseq to extract the reads based on the readname.

Normally this warning is not a problem, but in your case it is misleading since you have SE reads... will add a check for this.

Hope that helps.

Cheers,

dom

filip-husnik commented 7 years ago

Thanks, Dominik. I see, I'm too used to work with PE reads that I did not realize the flag will be wrong for the PacBio SE reads. I don't really need to filter the bamfile, I was just checking the bamfilter functionality (unfortunately with a wrong file).

Does it mean that read_cov and base_cov values in the .cov file generated by map2cov are correct for a bam file with SE reads and I should just ignore the warning with similar bam files such as from PEARed/FLASHed PE reads?

Cheers, Filip

DRL commented 7 years ago

No worries.

Yes, map2cov (as well as create) will output a correct read_cov/base_cov for both PE and SE, as it uses the following flags:

samtools view -F 1024 -F 4 -F 256

Remember, that the message is just a "warning" which means that counts are different than expected, based on flagstats. But it is not an "error".

The only reason flagstats is actually invoked is to be able to produce a progress-bar while parsing. As long as you a happy with only getting coverage from reads that pass the flags, everything will be fine.

cheers,

dom

filip-husnik commented 7 years ago

Awesome, thanks.

Filip