Allow precomputed baseline for single sample pipeline

wudustan commented 1 month ago

For pipelineCNA() a synthetic baseline is calculated per-sample. This can be an issue when you have a large dataset and have to run single samples for computational reasons as the baseline is different for each sample. There should ideally be a way to generate a baseline for whole dataset and then allow that as the subtraction for each sample separately.

Relevant code:

if (length(norm_cell_names) < 1) {
    print("7) Measuring baselines (pure tumor - synthetic normal cells)")
    count_mtx_relat <- removeSyntheticBaseline(count_mtx, par_cores = par_cores)
  } else {
    print("7) Measuring baselines (confident normal cells)")
    if (length(norm_cell_names) == 1) {
      basel <- count_mtx[, which(colnames(count_mtx) %in% norm_cell_names)]
    }
    else {
      basel <- apply(count_mtx[, which(colnames(count_mtx) %in% norm_cell_names)], 1, median)
    }
    count_mtx_relat <- count_mtx - basel
  }

wudustan commented 1 month ago

Additionally, large datasets stall the script due to rasterisation of the heatmap

AntonioDeFalco commented 1 month ago

Hi @wudustan, Why do you need to analyze a large dataset of multiple samples? The suggestion is to examine each sample at a time for more accurate results, only if you have several samples from the same patient could you analyze them together. The reasoning is like CNV analysis from bulk with the matched normal, versus an analysis with a Panel of Normal (PoN) created from multiple healthy tissue samples. You could use multiSampleComparisonClonalCN to compare multiple samples.

Thanks

wudustan commented 1 month ago

Thanks for replying @AntonioDeFalco

I have a large experiment (>20 libraries: 4 timepoints +/- drug) where a cancer stem cell culture was treated with high dose drug over a long period of time to generate resistant cells. All the cells in the experiment are malignant and I want to get a subclonal analysis to see if specific subclones develop and persist over time, but due to the way pipelineCNA() works, the algorithm will find ~5 'normal' cells in the sample and will then give me a garbled subclone analysis as a result.

From a conceptual point of view, finding an artificial baseline from 100% tumour single-sample-wise will also be problematic since individual samples will have different amounts and types of CNA events. If I could pre-calculate a baseline from the whole dataset and then use that to do clonal analysis on all libraries separately, I would get a more consistent result.

I previously ran the analysis as single libraries, but looking at the heatmap the script generates - I can tell the clustering and clonal calling isn't correct, but because the pipeline is one giant wrapper script, it makes it hard to modify. I can't pass it a vector of normal cells for norm_cell because there aren't any.

wudustan commented 1 month ago

@AntonioDeFalco do you have any advice for how to proceed?

AntonioDeFalco / SCEVAN

Allow precomputed baseline for single sample pipeline #124