ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

Mv flag is too much (?) #96

Closed underbais closed 2 months ago

underbais commented 1 year ago

Hi Christoffer,

Great package! My question is what's the rationale behind the Mv flag (minor variant). "Mv" flag is for Minor Variant, and is used when a different variant is present at the same base pair at higher VAF (issue #30). The thing is that type of scenario is exactly what we are studying. Patients under treatment develop drug resistance by positively selecting multiple variants at the same base. It's considered one of the mechanisms of drug resistance and it is confirmed by orthogonal methods (they are very likely real!). Flagging and filtering them out makes superFreq miss the important variants/clones. And they are only flagged by Mv not the others which makes them good calls otherwise. Would it make sense to introduce say --keepMv parameter to relax the filtering? Overall, superFreq is very strict which is a good thing. However, adding some flexibility wouldn't hurt, would it? I understand major flags Bq, Mq, Sb, St etc, but Mv might be an overkill. Thoughts?

Thanks Chingiz

ChristofferFlensburg commented 1 year ago

Hi!

These exact filtering strategy, cuts and choices like these evovled over years, especially at the start of the development. Every choice is a compromise between sensitivity and accuracy and as you say I've been tending towards a bit more conservative than many other variant callers.

This specific flag, I honestly don't remember the details behind. I assume we had a fair amount of noise, I think in particular around repeats where SNVs and indels of different lengths would be present due to stuttering. Recurring sequencing noise like that should be removed pretty well by the reference normals (that'll have the same stutter), but repeats with many different possible noise variants may not have all present in the reference normals, but would be caught by the Mv flag.

I get that it risks filtering real and interesting variants, especially around hotspots as you say, and it sounds like an interesting research project. I'd consider looking into removing the flag and see how much noise it introduces, and/or add an option as you say, and I think that'd make sense. However, it's not realistic to have this float to the top of my todo list anytime soon.

What I can do, is point you to where you can modify the code, and you should be able to compile your own version of superFreq. For that, the easiest and cleanest way would be to remove the flag entirely, and for that I think the easiest way would be to just remove (or comment out) line 724 and 725 in getVariantsByIndividual.R.

      minorVariants = ret$var > 0 & (ret$var < max(ret$var) | sum(ret$var == max(ret$var)) > 1)
      ret$flag[minorVariants] = paste0(ret$flag[minorVariants], 'Mv')

Then it should just not assign the flag at all, and it shouldn't affect somatic variant calling.

You can implement that as an option as well if you want, you'd have to add it to superFreq() in analyse.R, including the documentation if you want to be proper, and then pass it through all the layers until it reaches those lines.

Good luck

underbais commented 1 year ago

Hi!

Thanks for your detailed feedback and suggestions. Indeed, I checked what variants get flagged as Mv and it's a lot since all multiallelic sites would qualify as Mv. Keeping them all would introduce a lot of garbage. Ideally, a whitelist of a few important variants (with Mv flag only, no other flags) that could be imputed on top of filtered ones for downstream clonal analysis would make it perfect in my case. Would really appreciate if you point me in the right direction on how to implement that in code. It's basically one or two variants for a few patients that would require that. The issue is that those variants become big enough (in VAF over time) to be ignored for clonal tracking so that if ignored it would distort the phylogeny.

ChristofferFlensburg commented 1 year ago

If you just want to check specific variants in a limit number of samples, then I'd just look them up in the allVariants.Rdata output after the superFreq run. That has data on all the variants on all the sites in the input VCFs, flagged, somatic or otherwise. Use chrToX() to convert from chr-pos notation to the genome-spanning x superFreq uses.

In general, if a variant is called as somatic in one sample, then it's tracked across all samples and used in the phylogeny, even if flagged in other samples.

To force a variant into the clonal tracking... You'd need to set the somaticP to 1 (or larger than 0.9) somewhere between the somatic variant calling but before the clonal tracking. I'd probably do it just after the somatic variant calling. SomaticP is set in markSomatic() in matchFlagVariants.R, line 478-479

    somaticP = (1-pPolymorphic)*(1-pNormalFreq)*normalOK*pSampleOK*(1-pZero)*
               notInNormal*lowFrequencyPenalty*lowCoveragePenalty*fewVariantReadsPenalty*(1-censor)

You can put a forced whitelist just after those two lines. Match your hotspots against the position q$x (use chrToX above, make sure you set the genome= right), and the variant q$variant if you want (or just q$x if you want to capture all variants at that basepair). If you want to not mark 0 VAF, you can add your own cuts on q$var and q$cov, which are alternative and total read depths.

You can hard-code the hotspot variants (easier) or have them as input to superFreq() and pass it through all the nested function (harder).

The clonal tracking uses somatic variants with somaticP > 0.95 (getStories.R, line 45), together with high quality CNAs (these are the anchor mutations) to define the clones, and after that look wider at variants with somaticP > 0.5, and assigns those to similar clones defined by the anchor mutations. So if set your hotspot mutations to soamticP > 0.95, they will be anchor mutations, and are essentially guaranteed to make up a clone. Exception is if they break the crossing rules (ie, disjoint clones make up significantly more than 100% of a sample), then some clones will be removed. I think that should work, but I haven't tried this myself so unexpected issues might come up.

I get the importance of tracking hotspot variants in sites that are relevant to your system, but in general I am not a fan of forcing variants into an otherwise unbiased analysis like this. It's a very supervised analysis and not what superFreq is made for. If superFreq had any kind of warranties, they'd be void. The phylogeny you get out will no longer be an unbiased superFreq clonal tracking, but will be forced to contain certain variants, which you need to be very clear with in the publication if you publish these results. I personally wouldn't go that way, and instead more manually look up the VAFs and report those for what they are (just line-plots of VAF across samples for example), which is what we've done in similar studies.

good luck.

underbais commented 1 year ago

Hi

Thanks a lot. I was able to bypass Mv flag for my variants of interest by setting somaticP=0.99 for them. But then I realized they are also being "shaved" by cov, var and ref filtering as you mention in the manual (section 5.1.5 - Somatic variants). Although this is beneficial for overall pipeline, for some important hotspots with low VAF this leads to their disappearance. Those variants are confirmed to be true hits by an orthogonal method (ddPCR) and their VAFs (2-15% in this case) are correlated well between ddPCR and the variant caller values (I am using Octopus). However, what I see in superFreq results is that var reads are reduced to 1-4 reads which leads to reduction of VAF to as low as 0.015 or even total absence. I wonder where in the code this is controlled. Very curious to see if my hotspots make it to the clonal tracking with unaltered read counts and how would that affect the phylogeny. As of now, the phylogeny is skewed possibly due to absence/reduction of hotspot sub-clones - we have single cell data for some samples as a reference. Thanks again for your time.

ChristofferFlensburg commented 1 year ago

Oh I had forgotten that there is a manual, thanks for that! :D I am not quite sure what you are referring to though. Shaved? Setting somaticP to 0.99 should bypass the filtering on cov var and ref counts.

The read counts are reduced by quality, using the phred score of the base call quality and the alignment quality. For high quality reads and bases, this should be no or tiny effect and the superFreq ref/cov/var counts should match the ones you'd get by counting manually in IGV. If this is having a significant effect it means that the variant is based on low quality base calls and/or alignments, which is a worry. This is done much earlier in the pipeline, just after superFreq reads in the mpileups from the bam files. It's done on line 796+ in getVariantsByIndividual.R:

  #probabilities that the base call and mapping is correct.
  bqW = 1-10^(-pileup$qual/10)
  mqW = 1-10^(-pileup$mapq/10)
  qW = bqW*mqW
  RIB = sum(1-qW)/length(qW)
  if ( is.na(RIB) ) RIB = 0

  #count weighted reads.
  refN = round(sum(qW[ref]))
  varN = round(sum(qW[var]))
  cov = round(sum(qW))

Not sure if that is what you asked.