AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Evaluate concordance between Mutect2 and Strelka2 and decide on next steps #30

Closed jharenza closed 5 years ago

jharenza commented 5 years ago

Evaluate concordance and determine overall set of high-confidence somatic variants for PBTA.

cansavvy commented 5 years ago

I've gotten the data fairly well wrangled and loaded. What kinds of stats should I focus on evaluating the similarities/differences between these two? What kind of comparison plots would you like to see?

cansavvy commented 5 years ago

Here's what I'm thinking for first line analyses:

We can then do these analyses and subset these by the type of mutation. Does this sound reasonable as a first pass?

cgreene commented 5 years ago

I like those ideas!

On Wed, Jul 31, 2019, 6:44 AM Candace Savonen notifications@github.com wrote:

Here's what I'm thinking for first line analyses:

  • Venn diagrams of individual mutations detected by Mutect2 vs Strelka2
  • Correlation of total # mutations called for each sample
  • Correlation of total # mutations called for each gene

We can then do these analyses and subset these by the type of mutation. Does this sound reasonable as a first pass?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/30?email_source=notifications&email_token=AAEEPM3W4KENG5AKVAIN6MTQCGJMRA5CNFSM4IHR6UQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HJMAQ#issuecomment-516855298, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEEPM5SSRTXTKTPAIWRITTQCGJMRANCNFSM4IHR6UQA .

cansavvy commented 5 years ago

Here's the equivalents of the figure examples noted here. I need to neaten up the code some more, will get that in a better spot tomorrow.

Figure A:

Screen Shot 2019-08-07 at 3 51 25 PM

Figure B:

Screen Shot 2019-08-07 at 3 01 10 PM

Figure C:

Screen Shot 2019-08-07 at 3 50 34 PM
jharenza commented 5 years ago

This looks really great! Interesting - most of the mutations called by only one algorithm are subclonal (VAF <0.2). I think the next question is, are any of these artifacts/low confidence, and do we keep all/some/none?

cansavvy commented 5 years ago

I think the next question is, are any of these artifacts/low confidence, and do we keep all/some/none?

Do you have suggestions for analyses to approach this question?

jharenza commented 5 years ago

This is where another algorithm could help :) (in the works). Can you explore whether these are mostly coding/noncoding/synonymous/nonsynonymous/indels? Perhaps we focus here on coding variants only here, which have some predicted functional consequence, which is eventually what we will show in oncoprints (or split coding/noncoding into separate plots). That may help reduce some of these numbers. Something I was just reminded of that someone brought to our attention is that for some dinucleotide changes, the c. and p. changes between algorithms can differ in that one algorithm may call a GG>TT and another has two subsequent calls of G>T and G>T, therefore, they would not overlap between the two algorithms. I am not sure if investigating this can be automated somehow (I haven't looked at how much these variants differ/if they are as simple as that example).

gonzolgarcia commented 5 years ago

I would recommend to replicate figures D-E from my figure. This will give you an idea if there are biases in the frequency of changes.

cansavvy commented 5 years ago

@jharenza and @gonzolgarcia, here's the equivalent plot. Thoughts?

Screen Shot 2019-08-08 at 11 16 44 AM

gonzolgarcia commented 5 years ago

Probably would be better to represent as percentages so that the bars are comparable.

What you need (and my figure also lacks) is a reference/gold standard that tells you what are the expected frequencies to compare with this observation (some literature search could help). What is clear is that there is certain bias between 'both' and mutect and strelka only.

cansavvy commented 5 years ago

From your comment on PR #69, I have a follow up question :

predicted damaging by one or more of: SIFT, PolyPhen, or ClinVar?

Do you prefer these data are presented as their numeric values or their reported categories e.g. 'tolerated_low_confidence'

jharenza commented 5 years ago

I would say categories.

cansavvy commented 5 years ago

I have the files that list the pathogenic variants detected by each or both algorithsm as we discussed on PR #76.

If someone is willing/able to look at the set in the BAM files then I think we should ask for @cansav09 to produce a list of pathogenic ones that are Mutect2 only, Strelka2 only, and both callers.

Before I hand them over, I can shuffle and randomly select as @cgreene suggested, but how many would you like me to select and send over?

cansavvy commented 5 years ago

Plan is as laid out by @cgreene in PR #76

We will re-approach this type of analysis when the other variant callers [Lancet and Vardict have been run]((https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/76#issuecomment-525265821) and we will compare all 4 algorithms in a similar analysis.