VanLoo-lab / ascat

ASCAT R package
https://www.mdanderson.org/research/departments-labs-institutes/labs/van-loo-laboratory/resources.html#ASCAT
164 stars 85 forks source link

germline A/B counts #117

Closed imendizabalCIC closed 2 years ago

imendizabalCIC commented 2 years ago

Dear ASCAT Team,

I am interested in using ASCAT on SNP array data from tumors without matched blood samples, so I have followed your instructions with the example data as follows:

Running ASCAT without matched normal data. ASCAT can also run without matched normal data, and can infer the necessary germline genotypes from the tumour data:

library(ASCAT) ascat.bc = ascat.loadData(Tumor_LogR_file = "Tumor_LogR.txt", Tumor_BAF_file = "Tumor_BAF.txt", gender = rep('XX',100), genomeVersion = "hg19") ascat.plotRawData(ascat.bc, img.prefix = "Beforecorrection") ascat.bc = ascat.correctLogR(ascat.bc, GCcontentfile = "GC_example.txt", replictimingfile = "RT_example.txt") ascat.plotRawData(ascat.bc, img.prefix = "Aftercorrection") gg = ascat.predictGermlineGenotypes(ascat.bc, platform = "AffySNP6") ascat.bc = ascat.aspcf(ascat.bc, ascat.gg=gg) ascat.plotSegmentedData(ascat.bc) ascat.output = ascat.runAscat(ascat.bc, write_segments = T) QC = ascat.metrics(ascat.bc,ascat.output) save(ascat.bc, ascat.output, QC, file = 'ASCAT_objects.Rdata')

I see that I can get A and B allele counts for the tumor (from ascat.output$nA and $nB) but I would like to get those from the germline. The object gg$germlinegenotypes contains a matrix of TRUE/FALSE, but I would like to also get the A and B allele counts inferred for the germline. Does ASCAT provide them? Have you estimated how accurate these germline genotype calls are compared to using blood samples?

Thank you for the fantastic tool!

Isabel

tlesluyes commented 2 years ago

Hi Isabel,

ASCAT will not call germline copy-number variations on the normals because we assume that copy-number states for germlines are fixed (1+1 for autosomes, PAR1 and PAR2, whereas nonPAR is 1+1 for females and 1+0 for males). Instead, ASCAT calls somatic copy-number alterations. If you are interested in CNV calling, then bespoke methods should be used because they tend to be very small and an appropriate methodology is required.

Regarding the ascat.predictGermlineGenotypes function, it aims at inferring what are the germline heterozygous SNPs from tumour samples. This can be done by benchmarking flat samples for a given platform and, because each platform has a fixed design in terms of probes, we can evaluate what are the fractions of heterozygous/homozygous/noisy SNPs we observe and leverage such information when processing new samples. But again, this requires that there are no CNV and the vast majority of the samples don't have detectable CNVs. For large CNVs, they can be detected by running ascat.aspcf on the normals and checking if there is any segment with a weird logR/BAF record.

Cheers,

Tom.

imendizabalCIC commented 2 years ago

Hi Tom,

Thank you for your response. I apologize my question was not clear enough. I am actually interested in obtaining the germline genotypes (AA, AB or BB), even assuming there have no CNVs. Could I get this info from ASCAT output?

Thanks!

Isabel

tlesluyes commented 2 years ago

Hi Isabel,

Thanks for clarifying. Unfortunately, the standard ASCAT output and objects will not be very accurate (although informative) for that particular purpose.

In theory, the gg object (from ascat.predictGermlineGenotypes) could be used to infer germline genotypes as you propose (AB for heterozygous SNPs and AA/BB for the others, depending on their BAF) but in practice, it is defined based on some strong assumptions. One is that we are only interested in heterozygous SNPs so our selection method tends to be quite sentitive but not specific. It means that SNPs selected to be heterozygous must have strong evidence and a bunch of SNPs for which we're unsure will be treated as 'homozygous' so we can get rid of them. Therefore, using the gg object, you will find that many non-heterozygous SNPs would actually be heterozygous (likely noisy but sill). This can be observed in the following plot (from #73), where blue, green and red mean: homozygous, noisy and heterozygous, respectively. A bunch of SNPs shown in green are likely to be heterozygous but we would better discard them (so it cleans the input, even though we would loose a bit in terms of resolution) rather than keep them. https://user-images.githubusercontent.com/59569292/107258389-a2ac2880-6a33-11eb-8369-91c52fe5e673.png Another assumption is that the metrics we benchmarked for a given platform stand true for most cases. In practice, such parameters could be a bit off for some borderline cases. For instance, cases with stretches of homozygosity will have a lot of germline LOH, way more than other samples. But again we're pretty stringent so we've tried to minimise the impact on our selection of SNPs. It means that the retained ones are very likely to be heterozygous (AB) but a significant fraction of the other ones would also be heterozygous.

I think APT, ChAS and other software are pretty good at calling genotypes. Perhaps pairing such information with allele-specific CNAs will help retrieve germline information. This will be very hard for cases with somatic LOH and high purity though (so the germline information would be diluted and tumour BAF will be like 0.95/0.05 so impossible to tell which SNPs were heterozygous/homozygous).

Cheers,

Tom.

imendizabalCIC commented 2 years ago

Crystal clear, thank you so much for your explanations Tom!

Cheers,

Isabel

tlesluyes commented 2 years ago

No problem, closing the thread now.

Cheers,

Tom.