VanLoo-lab / ascat

ASCAT R package
https://www.mdanderson.org/research/departments-labs-institutes/labs/van-loo-laboratory/resources.html#ASCAT
162 stars 85 forks source link

ASCAT could not find an optimal ploidy and purity value for sample BEL_20C. #148

Closed chengwenxuan1997 closed 1 year ago

chengwenxuan1997 commented 1 year ago

I came across an error when trying to compute purity and ploidy on my sample. ASCAT could not find an optimal ploidy and purity value for sample BEL_20C. I doubt it is because there is no dominating tumor subclone based on the ascat plot. But I am not sure and want to ask you for suggestions. Many thanks in advance! The code is as follows.

ascat.bc <- ascat.loadData(
  Tumor_LogR_file = "BEL_20C_tumourLogR.txt",
  Tumor_BAF_file = "BEL_20C_tumourBAF.txt",
  Germline_LogR_file = "BEL_20C_normalLogR.txt",
  Germline_BAF_file = "BEL_20C_normalBAF.txt"
)
ascat.bc <- ascat.aspcf(ascat.bc)
ascat.plotSegmentedData(ascat.bc)
ascat.output <- ascat.runAscat(ascat.bc, pdfPlot = T, write_segments = F)

The output of ascat.prepareHTS is uploaded here. BEL_20C.zip

The ascat plot is as follows. BEL_20C ASPCF

tlesluyes commented 1 year ago

Hi @chengwenxuan1997,

ASCAT is unable to find a purity/ploidy fit because of the huge noise in the logR track (BAF looks okay though). Can you please provide some details on how logR and BAF were derived (platform, method)? Also, as indicated in the doc, we recommend correcting logR for covariates although I don't think it'll resolve such a level of noise.

Cheers,

Tom.

chengwenxuan1997 commented 1 year ago

Thank you very much for responding so quickly. Our WES samples were processed with the ascat.prepareHTS function. The detailed code is as follows. The genome version is hg38

BED.file <- list.files(file.path(ref.path, genome), pattern = "bed$", full.names = T)
probloci.file <- list.files(file.path(ref.path, genome), pattern = "^probloci.+txt.gz$", full.names = T)
ref.fasta <- list.files(file.path(ref.path, genome), pattern = "fa$", full.names = T)
alleles.prefix <- glue(file.path(ref.path, genome, "G1000_alleles*{genome}", "G1000_alleles_{genome}_chr"))
loci.prefix = glue(file.path(ref.path, genome, "G1000_loci*{genome}", "G1000_loci_{genome}_chr"))

# prepare logR from BAM
ascat.prepareHTS(tumourseqfile = file.path(data.path, SampleInfo$TumorBam)[i], 
                 normalseqfile = file.path(data.path, SampleInfo$NormalBam)[i],
                 tumourname = paste0(SampleInfo$SampleName)[i],
                 normalname = paste0(SampleInfo$SampleName, "_N")[i], 
                 allelecounter_exe = "alleleCounter",
                 gender = "XX", nthreads = 8,
                 genomeVersion = genome, 
                 BED_file = BED.file, probloci_file = probloci.file, ref.fasta = ref.fasta,
                 alleles.prefix = alleles.prefix, loci.prefix = loci.prefix)
chengwenxuan1997 commented 1 year ago

The BEL_20C is the only one of 22 samples which failed to get a purity estimation

tlesluyes commented 1 year ago

Can you provide the coverage values for both the tumour and the matched normal? It seems like the coverage for the tumour is good so the BAF does look good, but I'm wondering if the matched normal is a low-coverage sample so the logR is noisy because of the T/N normalisation. Can you dig into this?

Cheers,

Tom.

tlesluyes commented 1 year ago

Also, do the logR tracks for the 21 other samples look this noisy as well? The fact that ASCAT provides CNA profiles doesn't necessarily mean that they are correct: this level of noise will generate a bunch of false-positives.

chengwenxuan1997 commented 1 year ago

Do you mean the *.alleleFrequency.txt file? Because my pipeline deletes them automatically for convenient storage, I may need more time to recover these files. By the way, why the messy BAF signal looks good? There should be only one signal under the ASCAT assumption, to my understanding.

tlesluyes commented 1 year ago

Coverage can be inferred from the *.alleleFrequency.txt files but there are other tools to compute coverage from BAMs. Are we talking about a 10-20X sequencing or a 100X sequencing?

BAF looks messy but this is just because of CNAs, it looks good because imbalances can be seen and are captured by ASCAT.

Not sure what you mean with "one signal under the ASCAT assumption". One signal per what?

chengwenxuan1997 commented 1 year ago

No, the other samples all look very clear as follows. image

chengwenxuan1997 commented 1 year ago

The other bad one is BEL_1T, which has a messy logR plot and a clear BAF plot. But the pipeline succeeds in getting an estimation of this sample. BEL_1T ASPCF

chengwenxuan1997 commented 1 year ago

Sorry for not making it clear just now. The one signal under the ASCAT assumption means all B-allele frequencies perhaps should fluctuate around a specific value in one sample.

chengwenxuan1997 commented 1 year ago

So you mean the noisy logR is caused by the low coverage of the normal sample, and thus contributes to the failed estimation?

chengwenxuan1997 commented 1 year ago

The coverage of BEL_20C and BEL_20N are 141.5x and 24.02x.

tlesluyes commented 1 year ago

I can't see the tracks in the screenshot you sent a few messages ago but is the coverage for BEL_20N and BEL_1N lower than the ones for the other normal cases? My guess is that logR is noisy because the coverage in matched normal samples is quite low. Is there any sample where the matched normal also is ~25X but logR in the tumour is cristal-clear?

chengwenxuan1997 commented 1 year ago

The samples with the lowest coverage are BEL_28N (13.4, tumor 24.68), BEL_20N (24.02, tumor 141.5), BEL_1N (35.51, tumor 76.5), BEL_19N (39.11, tumor 16.3). The BEL_28T and BEL_19T are estimated successfully, their figures were as follows.

image

BEL_19T ASPCF

tlesluyes commented 1 year ago

Hi @chengwenxuan1997,

Thanks for providing additional information. Overall, the tracks look a bit noisy (at least noisier compared to what we observed with WES data from TCGA) but this is related to coverage I suppose. In the two profiles above, I'd say that the segments are correctly identified. Still, I don't know what's happening for BEL_1T and BEL_20T but the noise in logR is crazy high. I would recommend excluding these two samples (even if ASCAT does provide a CNA profile for BEL_1T). With TCGA data, we use MAPD>0.4 (see ?ascat.metrics) to define noisy samples. It also seems like the resolution for these two samples is lower compared to other samples (n_het_SNP from ascat.metrics).

Alternatively, you could try deriving logR using other methods and feed it into ASCAT or try a different approach such as ASCAT.sc.

Cheers,

Tom.

chengwenxuan1997 commented 1 year ago

Thanks very much. I will try to use ascat.metrics and ASCAT.sc in my pipeline. By the way, what could contribute to such a noisy logR? I may need to talk with our PI about focusing on or ignoring this sample. Is there any biological cause or just technical noise?

tlesluyes commented 1 year ago

Hi @chengwenxuan1997,

It can be due to various things: the sample itself (DNA degradation), poor capture, library prep, sequencing problems, low coverage, unmatched T/N, etc.

Cheers,

Tom.

chengwenxuan1997 commented 1 year ago

Alright, I got it. There are just too many uncertain factors. I really appreciate your response, it has been quite helpful to me.