caravagnalab / CNAqc

CNAqc - Copy Number Alteration (CNA) Quality Check package
GNU General Public License v3.0
17 stars 8 forks source link

Potentially overestimating Clonal (90-100% CCF) Mutations #5

Closed ahorn720 closed 3 years ago

ahorn720 commented 3 years ago

Hi there!

In summary, Im estimating CCF values and I've noticed that some mutations which I would have expected to be sub clonal (CCF<~75%) are showing up around ~90-100%. I wonder if you could explain to me why you think this is happening. I've included my basic code I've been using to estimate which mutations are likely clonal/subclonal too (which im happy to discuss or adjust with you too).

I hope you can see that even some of the sub clonal mutations showing up around 1.00 CCF have very low VAFs even after adjusting for purity which is why im very surprised they have such high CCF values.

I'll share an example output and email you an .rds for you too examine:

library(CNAqc)
#> ✓ Loading CNAqc, 'Copy Number Alteration quality check'. Support : <https://caravagn.github.io/CNAqc/>

# save(snvs,cnvs,purity,file = "~/Desktop/cnaqc_help.rds",compress = T)
load("~/Desktop/cnaqc_help.rds",verbose = T)
#> Loading objects:
#>   snvs
#>   cnvs
#>   purity

x = init(snvs = snvs,
         cna = cnvs,
         purity = purity,
         ref = "hg38")
#>  [ CNAqc - CNA Quality Check ]
#> ℹ Using reference genome coordinates for: hg38.
#> Warning in fortify_mutation_calls(snvs): You are using indels mutation data,
#> just beware that indels count-values aree less reliable than SNVs ones ....
#> Warning in enforce_numeric(x$NV): [CNAqc] Enforcing numeric for values: 5, 5, 5,
#> 8, 5, 3, ...
#> ! Missing CCF column from CNA calls, adding CCF = 1 assuming clonal CNA calls.
#> ! Missing segments length from CNA calls, adding it to CNA calls.
#> ! CNA chromosomes should be in the format 'chr*', I will add a 'chr' prefix.
#> ℹ Input n = 11166 mutations for 887 CNA segments (887 clonal, 0 subclonal)
#> x n = 42 mutations  cannot be mapped to segments and will be removed.
#> ✓ Mapped n = 11124 mutations to clonal segments (100% of input)
snvs.samp = x$snvs
pure.samp = x$purity
cna.samp = x$cna

#Calculate CCF
ccf = compute_CCF(x)
#> ── Computing mutation multiplicity for single-copy karyotype 1:0 ───────────────
#> ── Computing mutation multiplicity for single-copy karyotype 1:1 ───────────────
#> ── Computing mutation multiplicity for karyotype 2:0 using the entropy method. ─
#> ℹ Expected Binomial peak(s) for these calls (1 and 2 copies): 0.2805 and 0.561
#> ℹ Mixing pre/ post aneuploidy: 0.31 and 0.69
#> ℹ Not assignamble area: [0.321428571428571; 0.5]
#> ── Computing mutation multiplicity for karyotype 2:1 using the entropy method. ─
#> ℹ Expected Binomial peak(s) for these calls (1 and 2 copies): 0.219055056618508 and 0.438110113237017
#> ℹ Mixing pre/ post aneuploidy: 0.39 and 0.61
#> ℹ Not assignamble area: [0.231884057971014; 0.391304347826087]
#> ── Computing mutation multiplicity for karyotype 2:2 using the entropy method. ─
#> ℹ Expected Binomial peak(s) for these calls (1 and 2 copies): 0.179692504804612 and 0.359385009609225
#> ℹ Mixing pre/ post aneuploidy: 0.24 and 0.76
#> ℹ Not assignamble area: [0.177777777777778; 0.311111111111111]
#> 
#> ── Summary CCF assignments. (>10% NAs: not assignable with confidence) ──
#> 
#> # A tibble: 5 x 6
#>   karyotype     N Unknown p_Unkown QC    method 
#>   <chr>     <int>   <dbl>    <dbl> <chr> <chr>  
#> 1 1:0         896       0   0      PASS  ENTROPY
#> 2 1:1        3610       0   0      PASS  ENTROPY
#> 3 2:0        1916      97   0.0506 PASS  ENTROPY
#> 4 2:1        2686     310   0.115  FAIL  ENTROPY
#> 5 2:2        1184      60   0.0507 PASS  ENTROPY
ccf.data = CCF(ccf)
ccf.data$CCF[ccf.data$CCF>=1] = 1

ccf.data = ccf.data%>%
  select(-len.ref,-len.alt,- abs.diff,-to, -type,-segment_id)%>%
  rename(altc = NV)%>%
  mutate(adj.DP = round(DP*pure.samp[1,1]))

# Collect present mutations
ccfs.allSamps = ccf.data%>%
  filter(CCF>0)%>%
  mutate(present = "present")

# Determine Clonality
# Per mutation, Model the binomial distribution of variant reads over purity-adjusted depth.
# We expect that "clonal" mutations for 
# karyotype 1:1 and 2:2 would have at least 50% purity-adjusted depth
# karyotype 1:0 and 2:0 woul have at least 100% purity-adjusted depth
# karyotype 2:1 would either have at least 33% or 66% of the purity-adjusted depth 
# Determine if the number of altc reads is within the standard deviant of the mean of the Binomial dist.
is.mut.clonal = apply(ccfs.allSamps,1,function(x){
  # x = ccfs.allSamps[6351,]
  altc = as.numeric(x["altc"])
  adj.depth = as.numeric(x["adj.DP"])
  ploidy = as.character(x["karyotype"])

  if(ploidy %in% c("1:1","2:2")){
    p = .5
  }else if(ploidy %in% c("1:0", "2:0")){
    p = 1
  }else if(ploidy %in% c("2:1")){
    p = ifelse(altc<(.5*adj.depth),.33,.66) 
  }
  # Is altc within the Standard Deviation of the mean of the Binomial distribution
  n = adj.depth
  mean = n*p
  sd = sqrt(n*p*(1-p))
  lowend = floor(mean-sd)
  dist.to.binom.sd = altc-lowend #negative means outside SD. 0 and positive means inside SD

  return(dist.to.binom.sd)
})

# If is.mut.clonal is negative means number of altc reads is outside SD, "subclonal".
# If is.mut.clonal is 0 or positive means number of altc reads is inside SD, "clonal".
ccfs.allSamps = data.frame(ccfs.allSamps,as.numeric(is.mut.clonal))%>%
  mutate(clonal = if_else(is.mut.clonal>=0,"clonal","subclonal"))

ggplot(ccfs.allSamps,aes(CCF,fill = clonal))+
  geom_histogram(position = "identity",alpha = .5)+
  facet_wrap(Samples~.)+
  ylim(c(0,1000))+
  scale_x_continuous(breaks = seq.int(from = 0,to = 1,by = .25))+
  scale_fill_viridis_d(alpha = .5,end = .7)+
  ggtitle("Cancer Cell Fraction",
          subtitle = "Present Mutations: CCF>0 & altc>=3 & Depth >=10\n\"Clonal\": Number of variant reads (altc) is within SD of Binomial distribution \nB(purity-adjusted Depth,p)")+
  theme_minimal(10)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 1 rows containing missing values (geom_bar).

Created on 2020-12-17 by the reprex package (v0.3.0.9001)

caravagn commented 3 years ago

Hi, sorry for the late response. I think the main problem with these calls is that the CNA segments ploidy looks very weird, and should be checked. CCF computations should be done only on validated segments.

To make it clear, try running

x = analyze_peaks(x)
plot_peaks_analysis(x)

and compare ti what is shown here.

caravagn commented 3 years ago

Hi, I am closing this unless we need to keep this open. Allright?