Difference between geneLevel CN from WES (cnvkit) and geneLevel CN from snp6.0

zhouxuzhouxu commented 3 years ago

Hi, I got geneLevel CN (copy number) by using cnvkit based on WES data (20 tumor samples unmatched normal). In result, I found a clear discrepancy between geneLevel CN from WES (cnvkit) and geneLevel CN from array snp6.0. Is my code used incorrectly？ Here's my code: python3 cnvkit.py batch \ ${sample}.rmdup.bam \ --reference ${dataCNVkit}/FlatReference.cnn \ --output-dir ${dataCNVkit}/access_5k_Map/${sample}/

python3 ${cnvkit}cnvkit.py genemetrics \ ${sample}.rmdup.cnr \ -s ${sample}.rmdup.cns -t 0 -m 0 -y \ -o ${sample}.geneLevel.cns

python3 ${cnvkit}cnvkit.py call \ ${sample}.geneLevel.cns \ -o ${sample}.geneLevel.call.cns

tetedange13 commented 3 years ago

Hi @zhouxuzhouxu,

Not an author of CNVkit, but could you please precise what kind of "decrepancies" you are facing? => Are Copy Number values lower than expected from your array data? Higher? Or both? => Also is it on a particular gene? Or on several? => And what is the magnitude of your decrepancies? Are we talking about twice the expected value? More? Less?

I see in your commands (thx for detailing them BTW) that you used simplest parameters for your call step ? => If you haven't yet, I advice you to read CNVkit's documentation about Tumor Analysis and details about cnvkit.py call => To sum up, call subcommand have several parameters to adjust copy_ratio values, knowing for example your tumor purity

Hope this helps. Have a nice day. Felix.

zhouxuzhouxu commented 3 years ago

Hi Felix, Thanks for your quick response. In fact, I used simplest parameters for call step as shown above and I got copy number of over 20000 genes in 30 tumor cell lines with unmatched normal. Compared with the copy number of these cell line from array data, the proportion of genes with the same copy number was more than 80% (number of genes with same copy number/overlapped genes) in half of the cell lines, however, In the other half of the cell lines, the proportion of genes with the same copy number was less than 5%. These results seem to suggest that it is not stable. I have also read the documentation and do not know what to do. Do you have any suggestions? In addition, would with no control samples affect the final result？ Thank you for your time and consideration.

Best, XuZhou

zhouxuzhouxu commented 3 years ago

Hi @zhouxuzhouxu,

First could you please edit your last response to make it more "normal" (and more readable) => I think you should remove the "`" characters you put ! Simply write in plain text, it should be better

Best, Felix

Hi Felix, I've removed the space at the beginning of the line. The format is normal.

tskir commented 3 years ago

Hi @zhouxuzhouxu! In addition to what @tetedange13 already said—

CNVkit performs best on matched tumour-normal data. The normal data are used to accurately factor in baseline coverage variance. The flat reference can be thought of as last resort method, which will indeed perform significantly worse than if you had normal reference.

CN estimation using microarrays has its own caveats and problems. So what may be happening is that you're comparing two relatively noisy methods of analysis. However, even in this case I wouldn't expect concordance as poor as having <5% of genes with matching copy numbers. So I'm willing to investigate this further.

Let's try a few things:

I notice that your filenames contain rmdup in them, suggesting they went through duplicate removal step. This can sometimes interfere with depth estimations. Could you try re-running the whole workflow without duplicate removal and see if it improves the results? While we're at it, you also shouldn't filter the reads by mapping quality, because sometimes it can also degrade the results.
Since you have the results for 30 cell lines, it would be extremely helpful if you could plot scatter plots or heatmaps for each of them showing how copy number detected by CNVkit relates to one detected by microarray. If you can't plot all 30, please at least provide one good example (where CNVkit and microarray match more or less) and one bad (<5% concordance). Based on that, we can see if there is some systematic bias at play, or if the results are indeed completely uncorrelated

tskir commented 3 years ago

Hi @zhouxuzhouxu, just checking in to see if you were able to do some of the suggestions from my previous comment, or if you have any further questions? I'll be happy to help, this issue indeed looks like something worth investigating

zhouxuzhouxu commented 3 years ago

Hi @zhouxuzhouxu, just checking in to see if you were able to do some of the suggestions from my previous comment, or if you have any further questions? I'll be happy to help, this issue indeed looks like something worth investigating

Hi tskir, Thanks for your quickly response. I tried advice from you.

The cnvkit process is repeated by using sorted bam files instead of rmdup bam files, however, there was little change in the results. so that the difference still exists between cnvkit and Array in some samples.
I drew a scatter plot of the copy numbers of the same gene between cnvkit and Array. In inconsistent results, the copy numbers of the same gene is larger from Array data than cnvKit. I have no idea about it. Could you give me some good advice?

Best, Xu Zhou

etal / cnvkit

Difference between geneLevel CN from WES (cnvkit) and geneLevel CN from snp6.0 #630