tcgacompare using WGS data (question)

ahwanpandey commented 5 years ago

Hello,

I have a question regarding mutation burden that hopefully you can give me some insight for. I have WGS data for two cohorts. They have been sequenced at different depths:

Now would I be able to directly compare the mutation burden of these two cohorts using tcgaCompare against the TCGA samples? Two things that seem very different here are that

all my samples are WGS and the tcgaCompare plot is for WXS data.
my cohorts have been sequenced at different depths.

Thanks.

PoisonAlien commented 5 years ago

Hi, Theoretically you can compare against any cohorts. Underlying test is a simple fisher's test which checks for differences in ratios based on read counts.

tcgaCompare only focusses on non-synoynmous variants, so you should be fine. You should be worried only if you're looking for non-exonic variants.
This should be fine too IMO.

DarioS commented 5 years ago

I think the read depth would make a difference for some variant calls. What about analysing the cohorts separately and reporting them individually? You have lots of samples in both. It would be interesting to see how many variants you find in the cohort with larger mean depth and not in the one with lesser mean depth.

ahwanpandey commented 5 years ago

The issue here is that we are worried that the higher mutation burden we are seeing in the high coverage group is because of the higher coverage and not the biology. Since the two cohorts we are comparing are so homogenous in their coverage distribution it doesn't seem right to compare directly with a metric like mutations_per_mb. The higher coverage data would have an advantage at detecting low frequency variants or even just having more power to detect variants in general ( we are using Mutect2, Varscan2, VarDict and Strelka2 with their defaults )

We have thought of things like downsampling the data, creating a higher VAF cut-off or even making a metric like (Mutation Burden = mutations_per_mb /coverage).

Do you have any suggestions or experience regarding this situation where the two cohorts are so homogenous in their coverage distribution?

PoisonAlien commented 5 years ago

Hello, These issues are inherently associated with sequencing. Simple solution I could suggest is to genotype all your variants detected across all samples.

Assuming you have vcf per sample, merge all the vcfs to generate a consensus set of variants which would include all unique variants detected across all samples.
Genotype this consensus set of variants for every sample. GATK has Genotype Given Allele (GGA) mode which takes a vcf file and only genotypes them. I am pretty sure strelka2 also have this feature.
Once you do this - you can be sure that those variants which would otherwise have been missed in low coverage samples will have been genotyped. You can then filter those variants genotyped as reference/Germline for each sample and keep alt allele. This way you will rescue missed somatic variants.

This is also what we did in one of the project where we did multi region sequencing of a tumor, where we had to force call a consensus set of variants due to coverage differences.

I hope my explanation was clear.

ahwanpandey commented 5 years ago

Hi @PoisonAlien, I think our cohort is too heterogeneous to apply that strategy. I can see how this strategy would work for Germline data (like GATK Joint calling) or data from single patient/cell-line as their mutation spectrum would be similar. Each of the sample in our cohort is from a different patient so it is not necessary that they would share similar mutation sites.

But thanks for sharing your experience!

PoisonAlien / maftools

tcgacompare using WGS data (question) #374