Griffan / VerifyBamID

VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
http://griffan.github.io/VerifyBamID/
94 stars 15 forks source link

VerifyBamID v2 vs v1 #25

Closed xtmgah closed 3 years ago

xtmgah commented 3 years ago

Hello, I am trying use the V2 to quantify the contamination and ancestry for both tumor and normal in a large cohort, which including tumor and paired normal WGS samples from different populations. I found some weird results, specially when comparing v2 with v1 results.

For the V1, I use the recommended Omni25_genotypes_1525_samples_v2.b37.PASS.ALL.sites.vcf.gz as the --vcf input. For the V2, I used parameter "--SVDPrefix /data/zhangt8/Ref/VerifyBam2/VerifyBamID/resource/1000g.phase3.100k.b38.vcf.gz.dat".

For the V1 result, no samples have been found with FREEMIX <5%. For the V2 result, I found no normal samples, but many samples (348/1200) with FREEMIX >5%; the population ancestry looks fine for both tumor and normal samples (from IntendedSample) when plot with 1kg together.

My question, why the V2 get some many tumor samples with higher FREEMIX ? should I use the increased number of makers to do this (since only 100k marks in V2)? For the Ancestry file, what's the "ContaminatingSample" file here? looks the PCs in this column much large than the "IntendedSample".

Thanks.

Griffan commented 3 years ago

Could you confirm that you are using the resource files with correct build version, your example shows you used b37 for V1 but b38 for V2.

ContaminatingSample means which sample the contamination DNA come from. IntendedSample is the sample you are trying to test.

xtmgah commented 3 years ago

@Griffan Thanks. For the V1, we run the hg19 BAM files. After that, we re-aligned the reads to hg38 genomes (CRAM) and re-run the V2. I just can't figure out why the results are so much difference. Any idea?

xtmgah commented 3 years ago

@Griffan Recently, I re-run the hg38 in CRAM format and re-run V1 with Omni25_genotypes_1525_samples_v2.b38.PASS.ALL.sites.vcf.gz; the result looks pretty similar to V1 using hg19 BAM. So, it think the V2 have some issue to estimate the FREEMIX?

In addition, the ContaminatingSample PCs looks like the much large PC values than IntendedSample PCs. I guess this is normal? right?

Griffan commented 3 years ago

Since you mentioned that only tumor samples showed larger FREEMIX, I think it might be a result of aneuploidy. You can try to exclude SNPs in CNV regions and see if this resolves the issue. The FREEMIX and PCs are jointly estimated, if any one of the estimations of these parameters shows instability, the other won't hold. Plus, usually there is very little DNA material from the contaminating sample in the dataset, hence the estimation could have higher variance but the PC values are still expected to be within its corresponding region.

hyunminkang commented 3 years ago

Most likely V1 underestimates the contamination rather than V2 overestimating it. V2 is more sensitive in detecting contamination especially for non-Europeans.

Aneuploidy might be a reason, but it is hard to tell what caused the high contamination estimates without looking at the details.

Thanks, Hyun.

On Sat, Apr 24, 2021 at 2:52 AM Griffan(Fan Zhang) @.***> wrote:

Since you mentioned that only tumor samples showed larger FREEMIX, I think it might be a result of aneuploidy. You can try to exclude SNPs in CNV regions and see if this resolves the issue. The FREEMIX and PCs are jointly estimated, if any one of the estimations of these parameters shows instability, the other won't hold. Plus, usually there is very little DNA material from the contaminating sample in the dataset, hence the estimation could have higher variance but the PC values are still expected to be within its corresponding region.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Griffan/VerifyBamID/issues/25#issuecomment-826045910, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OO3NVWBLZMM5J6ETMLTKJTELANCNFSM43GACLPA .

xtmgah commented 3 years ago

Hello @hyunminkang, Thanks for your information. Do you know why the Aneuploidy will cause high contamination and how can we adjust this? We found most of them with high V2 FREEMIX value are likely to have WGD. Is there any way we could use to consider this? And what threshold hold we could apply to filter out the high contaminated tumor samples? Thanks

hyunminkang commented 3 years ago

To be short, FREEMIX relies on the heterozygosity in the genotype data. If there are more heterozygotes variants than expected by HWE, FREEMIX estimates it as evidence of contamination.

When aneuploidy happens, it could create "somatic homozyotes (or hemizygotes)" or "somatic heterozygotes (or polyploidy)" depending on how deletion/duplication happens. If somatic homozygotes happen most frequently, FREEMIX may be underestimated. If somatic heterozygotes happen more frequently, FREEMIX may be overestimated.

If you expect a high degree of somatic aneuploidy in your sample, there is no easy way to modify verifyBamID2 to estimate FREEMIX reliably. You need a different method to estimate the contamination tailored to somatic mutation.

Thanks, Hyun.

On Sat, Apr 24, 2021 at 1:24 PM xtmgah @.***> wrote:

Hello @hyunminkang https://github.com/hyunminkang, Thanks for your information. Do you know why the Aneuploidy will cause high contamination and how can we adjust this? We found most of them with high V2 FREEMIX value are likely to have WGD. Is there any way we could use to consider this? And what threshold hold we could apply to filter out the high contaminated tumor samples? Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Griffan/VerifyBamID/issues/25#issuecomment-826125430, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OKBO6CVXE24PKLSZBTTKL5FNANCNFSM43GACLPA .

xtmgah commented 3 years ago

Thanks. That make sense. Do you have any recommendation on different method to estimate the contamination for tumor samples with aneuploidy?

Thanks.