genome / bam-readcount

Count bases in BAM/CRAM files
MIT License
299 stars 95 forks source link

Many single-end mapping quality warnings #2

Closed sjackman closed 11 years ago

sjackman commented 11 years ago

BWA-MEM 0.7.4 does not output single-end mapping qualities. Running bam-readcount results in millions (one per read) of warnings. Please emit one warning and suppress all following warnings.

Couldn't grab single-end mapping quality for read M01171:3:000000000-A3E2L:1:1101:10683:2958. Check to see if SM tag is in BAM
tabbott commented 11 years ago

Yeah that is definitely annoying. I'll fix that tonight. I'm thinking of adding a flag -w n where n is the max # of times to emit a warning about a particular missing tag.

tabbott commented 11 years ago

I pushed a change that enables the -w n flag to cap warning messages at n. I'll ask the author of the tool what he thinks about your other suggestion (using normal mapping quality in place of SE in the absence of the SM tag).

sjackman commented 11 years ago

Thanks, Travis. The -w option works like a charm.

sjackman commented 11 years ago

Any plans for a stable release that includes the new -w and -d options?

balthasar0810 commented 9 years ago

hi,

i am using bam-readcount to filter out false positives from snp calling and encountered the same issue.

i am getting (sometimes almost 5million) warnings like: WARNING: In read XXX_XXX_XXX: Couldn't find single-end mapping quality. Check to see if the SM tag is in BAM.

actually, i dont fully understand what it exactly means? (please find an example read pair below)

as you can see from the example, both reads have a proper mapq field, so, 1.) where is the problem? 2.) why does it need the sm tag? 3.) why is there no problem with most of the other reads?

thanks so much for any advice! chris

here is one of the read pairs, XXX_XXX_XXX 99 1 565899 21 50M = 565967 68 CTACGCCTAATCTACTCCACCTCAATCACACTACTCCCCATATCTAACAA `````````````````````````````````````````````````A XA:i:3 MD:Z:2G35T11 XE:Z:-------------------------------------------------- PG:Z:bfast IH:i:1 NH:i:2 HI:i:1 CM:i:0 NM:i:2 CQ:Z:A<A=;>BBB@BB@ABBABAAAB?BBABB@BBB?BBA<BBBB@ABBBBA MQ:i:28 AS:i:2100 CS:Z:T22313302303223122011022103211112312200013332230110 RG:Z:4

XXX_XXX_XXX 147 1 565967 28 35M = 565899 -68 GTTTGAACACACAACACCCACCCCATTCCTCCCCA 2HNN""1\SUX^M9O```````\ XA:i:3 MD:Z:14A20 XE:Z:----11----------------------------- PG:Z:bfast IH:i:1 NH:i:2 HI:i:1 CM:i:2 NM:i:1 CQ:Z:@ABBB<7B?;@=BB=:BBA%+9@5:28;B21-8-2 MQ:i:21 AS:i:1300 CS:Z:T01000220203100011001110111111111001 RG:Z:4

ernfrid commented 9 years ago

Hi Chris,

You should use the -w option (for example -w 1) to suppress repeated warnings. I will make this the default in a future release. The reason those exist is that bam-readcount attempts to calculate the average single ended mapping quality for each allele. If the aligner doesn't place that value in the BAM record then it cannot do so.

balthasar0810 commented 9 years ago

hey thanks so much for the quick reply!

1.) but why does it not complain about ALL reads? 2.) does it somehow mess up my fpfilter.pl results?

best, chris

ernfrid commented 9 years ago
  1. It should complain about all reads that it observes overlapping your variants of interest. My guess is that is the 5 million warnings that you're seeing: 1 for every read analyzed.
  2. It should have no effect on your fpfilter.pl results at all.
balthasar0810 commented 9 years ago

Thank you very much!

fpbarthel commented 9 years ago

Hi, I have the impression that it does affect the fpfilter.pl results. Whenever this error appears, I consistently have fpfilter returning nothing, eg:

15988 variants in input file 15988 had a bam-readcount result 15964 had reads1>=2 0 passed filters 15988 failed filters 0 failed because no readcounts were returned 0 failed minimim variant count < 2 143 failed minimum variant freq < 0.1 3494 failed minimum strandedness < 0.1 48 failed minimum variant readpos < 0.1 96 failed minimum variant dist3 < 0.1 6610 failed maximum variant MMQS > 150 5029 failed maximum MMQS diff (var - ref) > 150 143 failed maximum mapqual diff (ref - var) > 50 2658 failed minimim ref mapqual < 30 9454 failed minimim var mapqual < 30 15976 failed minimim ref basequal < 30 15919 failed minimim var basequal < 30 1342 failed maximum RL diff (ref - var) > 0.25

Does anyone have this too? Any fix for this? Should I do a base recalibration? This is CGHub TCGA data which should be properly recalibrated.

ernfrid commented 9 years ago

The number failing the minimum basequality filter is very surprising to me. It seems to me it could be one of two things:

  1. A parsing error in the fpfilter.pl script
  2. Much lower basequalities than I'm used to seeing on Illumina data.

Where did your fpfilter.pl come from? Has it been modified from any of those commonly utilized (VarScan, SomaticSniper or the ckandoth version)?

Would you be able to share an example line from your data? I don't need the position information, just the count section.

fpbarthel commented 9 years ago

I wonder if it's the first one. This is Broad aligned Illumina WGS seq data which I guess passed rigorous QC.

The fpfilter I'm using is the VarScan version which has been integrated in the latest release 2.3.9 here https://sourceforge.net/projects/varscan/files/. It's not been modified in any way.

See also my thread there https://sourceforge.net/p/varscan/discussion/1073559/thread/b2c0c9cb/.

Here's the first few lines of an example input file https://dl.dropboxusercontent.com/u/2003036/test2.snp.Somatic.hc (~9kb) And readcount file https://dl.dropboxusercontent.com/u/2003036/test2.snp.Somatic.hc.rc (~34kb)

ernfrid commented 9 years ago

I think there might be an off by one error in the bam-readcount parsing parts of the integrated fpfilter. I would suggest moving back to using the perl script instead. Dan is out of the office until next week, but may be able to address the issue when he returns.

ernfrid commented 9 years ago

On further review, I take that back. I misread the code. I think the issue here is that for your data the avg basequality is actually a little less than 30. Base quality recalibration usually lower base qualities so the default of 30 here is likely to be too high on these data. I would consider reducing the cutoff there and seeing if that fixes the issue.

fpbarthel commented 9 years ago

Somehow suprising because the data that has this is very new data, sequenced late last year/early this year whereas the data that does not is much older (several years). All of it is Broad data. There is one thing different, that is that the read length is 150bp versus 101 bp on the older data.

ernfrid commented 9 years ago

It depends on the platform. I believe the HiSeq X does have lower base qualities than older Illumina platforms (although I do not know if these data were generated on that platform). At MGI, we do not filter on average base quality (this appears to be a recent addition to VarScan) and I do not have a good sense for appropriate cutoffs. If you are still observing issues with the perl script, I am happy to troubleshoot, but unfortunately, further investigation into the newly integrated VarScan filter would need to wait until Dan is back.

fpbarthel commented 9 years ago

I think you're right that these might be too stringent. I work with glioma data and these filters seem to have removed all IDH mutations and all TERT promoter mutations.It's quite surprising that there are still 400k mutations left!

ernfrid commented 9 years ago

Well that seems like a good positive control and those results are not reassuring. I would be curious to know if using the Perl script helps.

fpbarthel commented 9 years ago

I followed your initial suggestion and changed the minimum base quality (variant and reference) to 10. This was able to recover all TERT and IDH mutatations (over 70 collectively) which suggests that this filter was the major culprit. Results look good so far.

ernfrid commented 9 years ago

Ah, excellent. I will relay to Dan.