map2cov parses more reads than expected by samtools flagstat

filip-husnik commented 7 years ago

Hi Dominik,

My issue is similar to #39 where blobtools also reports 'more' reads than expected by flagstat. My bam file with corrected PacBio fasta reads mapped to the reference genome by bwa mem raises a warning with map2cov. It does not seem that the differences in numbers represent multimappings somehow missed by samflags used by map2cov. samtools view -f 1 -F 1024 -F 256 -F 2048 according to #40

I have quite a lot of supplementary reads. Is there maybe a problem in the way how BtIO.py gets the numbers of mapping and total reads? The warning says Based on samtools flagstat: expected 5031818 reads, 5529014 reads were parsed but when I run flagstat separately it correctly prints 5529014 + 0 mapped (99.71% : N/A)

Thanks!

Filip

My blobtools version is blobtools v0.9.19. I have only samtools 1.3.1 (+htslib 1.3.1) in my $PATH and this is its flagstat output:

5545330 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
497196 + 0 supplementary
0 + 0 duplicates
5529014 + 0 mapped (99.71% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

This is the map2cov warning:

[STATUS]    :   Checking with 'samtools flagstat'
[STATUS]    :   Mapping reads = 5,031,818, total reads = 5,048,134 (mapping rate = 99.7%)
[PROGRESS]  :   100%
[PROGRESS]  :   109% 
[WARN]      : Based on samtools flagstat: expected 5031818 reads, 5529014 reads were parsed

DRL commented 7 years ago

Hi Filip,

The command that is being run in the background in order to actually extract the reads is:

samtools view -f 1 -F 1024 -F 256 -F 2048

where the issue concerning your data is the -f 1 which mean the reads have to be paired, which I think yours are not. Hence the output will be wrong.

Bamfilter was meant for paired-end reads since SE reads can easily be grep'ed (which is also faster):

grep the contig-names you want to keep/delete
cut the readname column and put into file
Use something like Heng Li's seqtk subseq to extract the reads based on the readname.

Normally this warning is not a problem, but in your case it is misleading since you have SE reads... will add a check for this.

Hope that helps.

Cheers,

dom

filip-husnik commented 7 years ago

Thanks, Dominik. I see, I'm too used to work with PE reads that I did not realize the flag will be wrong for the PacBio SE reads. I don't really need to filter the bamfile, I was just checking the bamfilter functionality (unfortunately with a wrong file).

Does it mean that read_cov and base_cov values in the .cov file generated by map2cov are correct for a bam file with SE reads and I should just ignore the warning with similar bam files such as from PEARed/FLASHed PE reads?

Cheers, Filip

DRL commented 7 years ago

No worries.

Yes, map2cov (as well as create) will output a correct read_cov/base_cov for both PE and SE, as it uses the following flags:

samtools view -F 1024 -F 4 -F 256

Remember, that the message is just a "warning" which means that counts are different than expected, based on flagstats. But it is not an "error".

The only reason flagstats is actually invoked is to be able to produce a progress-bar while parsing. As long as you a happy with only getting coverage from reads that pass the flags, everything will be fine.

cheers,

dom

filip-husnik commented 7 years ago

Awesome, thanks.

Filip

DRL / blobtools

map2cov parses more reads than expected by samtools flagstat #46