Closed jharenza closed 5 years ago
I've gotten the data fairly well wrangled and loaded. What kinds of stats should I focus on evaluating the similarities/differences between these two? What kind of comparison plots would you like to see?
Here's what I'm thinking for first line analyses:
We can then do these analyses and subset these by the type of mutation. Does this sound reasonable as a first pass?
I like those ideas!
On Wed, Jul 31, 2019, 6:44 AM Candace Savonen notifications@github.com wrote:
Here's what I'm thinking for first line analyses:
- Venn diagrams of individual mutations detected by Mutect2 vs Strelka2
- Correlation of total # mutations called for each sample
- Correlation of total # mutations called for each gene
We can then do these analyses and subset these by the type of mutation. Does this sound reasonable as a first pass?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/30?email_source=notifications&email_token=AAEEPM3W4KENG5AKVAIN6MTQCGJMRA5CNFSM4IHR6UQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HJMAQ#issuecomment-516855298, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEEPM5SSRTXTKTPAIWRITTQCGJMRANCNFSM4IHR6UQA .
Here's the equivalents of the figure examples noted here. I need to neaten up the code some more, will get that in a better spot tomorrow.
Figure A:
Figure B:
Figure C:
This looks really great! Interesting - most of the mutations called by only one algorithm are subclonal (VAF <0.2). I think the next question is, are any of these artifacts/low confidence, and do we keep all/some/none?
I think the next question is, are any of these artifacts/low confidence, and do we keep all/some/none?
Do you have suggestions for analyses to approach this question?
This is where another algorithm could help :) (in the works). Can you explore whether these are mostly coding/noncoding/synonymous/nonsynonymous/indels? Perhaps we focus here on coding variants only here, which have some predicted functional consequence, which is eventually what we will show in oncoprints (or split coding/noncoding into separate plots). That may help reduce some of these numbers. Something I was just reminded of that someone brought to our attention is that for some dinucleotide changes, the c. and p. changes between algorithms can differ in that one algorithm may call a GG>TT and another has two subsequent calls of G>T and G>T, therefore, they would not overlap between the two algorithms. I am not sure if investigating this can be automated somehow (I haven't looked at how much these variants differ/if they are as simple as that example).
I would recommend to replicate figures D-E from my figure. This will give you an idea if there are biases in the frequency of changes.
@jharenza and @gonzolgarcia, here's the equivalent plot. Thoughts?
Probably would be better to represent as percentages so that the bars are comparable.
What you need (and my figure also lacks) is a reference/gold standard that tells you what are the expected frequencies to compare with this observation (some literature search could help). What is clear is that there is certain bias between 'both' and mutect and strelka only.
From your comment on PR #69, I have a follow up question :
predicted damaging by one or more of: SIFT, PolyPhen, or ClinVar?
Do you prefer these data are presented as their numeric values or their reported categories e.g. 'tolerated_low_confidence'
I would say categories.
I have the files that list the pathogenic variants detected by each or both algorithsm as we discussed on PR #76.
If someone is willing/able to look at the set in the BAM files then I think we should ask for @cansav09 to produce a list of pathogenic ones that are Mutect2 only, Strelka2 only, and both callers.
Before I hand them over, I can shuffle and randomly select as @cgreene suggested, but how many would you like me to select and send over?
Plan is as laid out by @cgreene in PR #76
We will re-approach this type of analysis when the other variant callers [Lancet and Vardict have been run]((https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/76#issuecomment-525265821) and we will compare all 4 algorithms in a similar analysis.
Evaluate concordance and determine overall set of high-confidence somatic variants for PBTA.