etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
549 stars 166 forks source link

Building a (frankenstein) reference from tumour reference #223

Closed jgrady-omico closed 7 years ago

jgrady-omico commented 7 years ago

I have a large collection of tumour samples that I'd like to make a reference from. All of the samples have large scale copy number changes in at least one of the chromosomes, and often many, so I can't just take out a few well behaved tumour bams and use them as the reference.

What I would to do is a first pass analysis using CNVkit, identify normal ploidy chromosomes with no structural changes, and then use just those egions from each of the samples to build the reference file.

I'm not sure that CNVkit will be happy building a reference file in this way though - it would have to be built chromosome by chromosome, rather than genome wide. I haven't delved into the code to find out if this would be feasible, but I suspect it's not built to do this.

Do you have any insight on a good way to do this?

etal commented 7 years ago

Yes, CNVkit should tolerate this approach. If you construct each chromosome separately, just concatenate the per-chromosome reference .cnn files to create an all-chromosome reference .cnn at the end. You may see some warnings about the absence of an "X" or "chrX" chromosome, but if you use a complete reference .cnn for downstream processing it should handle the sex chromosomes OK.

Note that when the reference is constructed from a pool of samples, there is some effort to detect and remove outlier datapoints, which can be non-recurrent CNVs in the input samples. You could try building a pooled reference with no extra steps, and plot it to see if the input CNVs had any disruptive effect after this automatic filtering.

jgrady-omico commented 7 years ago

Thanks for the response, that’s encouraging.

The main reason we haven’t used just the raw tumour bams to construct the reference is that we do have a number of frequently recurrent CNVs in key genes, we expect the number of ‘outliers' for some of these is likely to be high enough to cause miscalling for these genes. We’ll certainly try just pooling the tumours in any case and take a look.

If we do construct the reference per-crhomosome, I’m not sure how it will deal with constructing the X chromosome, if we supply ony this and no autosomes. I presume it won’t be able to determine sex? Or perhaps I’ve misunderstood how the sex determination works. My suspicion is that we’ll have to construct the sex chromosomes from a single sex cohort, is that right?

On 30 Jun 2017, at 7:27 am, Eric Talevich notifications@github.com wrote:

Yes, CNVkit should tolerate this approach. If you construct each chromosome separately, just concatenate the per-chromosome reference .cnn files to create an all-chromosome reference .cnn at the end. You may see some warnings about the absence of an "X" or "chrX" chromosome, but if you use a complete reference .cnn for downstream processing it should handle the sex chromosomes OK.

Note that when the reference is constructed from a pool of samples, there is some effort to detect and remove outlier datapoints, which can be non-recurrent CNVs in the input samples. You could try building a pooled reference with no extra steps, and plot it to see if the input CNVs had any disruptive effect after this automatic filtering.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/etal/cnvkit/issues/223#issuecomment-312113365, or mute the thread https://github.com/notifications/unsubscribe-auth/AVU_QfhYi1yAgV0Yu4Df0h5NLPWZE_OFks5sJBaugaJpZM4ODOYd.

etal commented 7 years ago

If you're willing to do some scripting, it's not a terrible idea to first identify the likely CNVs in each tumor sample, then go back to each sample's .cnn files and set the log2 values to 0 where each predicted CNV occurs. Then those edited .cnn copies can be used to construct a reference with better behavior.

If you do it that way, then the X chromosomes will mostly be OK, though check the printed inferences versus your own records. If there's a problem you can correct it with -x.

If you construct the reference per-chromosome, CNVkit will whine about missing chrX, but it should be fine if you omit the -y option and use a female reference throughout your pipeline. Inferring chromosomal sex will be slightly less reliable since CNVkit isn't able to look at X and Y simultaneously, but it might be OK anyway, and/or you can still use -x to declare the sample's sex and skip inference.