ccagc / QDNAseq

QDNAseq package for Bioconductor
47 stars 27 forks source link

detect <3M CNVs by adjust alpha? #36

Closed jiading closed 8 years ago

jiading commented 8 years ago

I have a small (~1.5M) CNV, which is reported by QDNAseq with deeper sequencing. Once I take the subset of the bam file, reduce the input to 1M mapped reads, I still can see at the end of Chr, there are 3 dots, but they are no longer be reported as one CNV.

2

1

I can't recover it even by change alpha to 0.0001.

I don't think it's a issue from QDNAseq, but rather a shortage of CBS algorithm. Or maybe I shall try other parameters?

Or if you have any suggestions on finding small CNVs?

ilarischeinin commented 8 years ago

Yes, this is a question about DNAcopy (the package used for segmentation) or CBS (the name of the algorithm it uses) and the associated parameters, and not really about QDNAseq.

For the amount of sequence coverage you have, 500 bp bins seem large to me. If you used e.g. 15 kbp bins, those 1.5 Mbp would contain 100 data points instead of 3. With 8M reads, I think 15 kbp bins is what I would use personally. However, your data does look very noisy (the difference in the standard deviations), so it might be too small. Still, going even to 100 bp would give you 5x as many data points.

jiading commented 8 years ago

My data is from single cell, therefore it's so noisy, and I was advised to choose larger bin.

Is there any other R package instead of CGHcall, the model always fail when I lower the bin <100k. Of course, I could also just use QDNAseq without CGHcall part.

ilarischeinin commented 8 years ago

Ah! Very interesting.

There are probably quite a few other packages for calling, but they might require some manual manipulation of the data. You could export it from QDNAseq with exportBins() and then use basically any package developed for microarray/CGH data.

But especially since you have single-cell data, I'd expect the cutoff-based calling in callBins() to work well too. So, you could just specify method="cutoff" and maybe adjust the cutoffs if needed.

(The main need for calling algorithms is that typically cancer samples vary in what percentage of them is cancer cells and what percentage is normal cells. This makes it difficult to set fixed cutoffs that can be used across samples. Hence, you need a more sophisticated algorithm to decide where what is normal and what is not. But since your data is single-cell, simple cutoffs should work much better. Of course you could still have polyploid/aneuploid with levels not where you'd expect them to be, but the calling problem is hugely simplified from the scenario of larger tissue blocks with lots of cells.)