Value Error: Duplicated genomic coordinates in sample set due to different transcripts

ktr0nimus commented 2 years ago

I went through systematically and generated all files required for CNVKIT. When I run the batch pipeline using my already generated reference, I get all of the target/antitarget files from the *Tumor.bam files. The process ends with no errors but does not produce .cnr/.cns etc. If I select a sample to individually generate a .cnr with 'cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr', I get 'ValueError: Duplicated genomic coordinates in sample set:' followed by a list of coordinates. I would like to maintain the different transcripts that this represents, but it is generating an error because of them.

If I look in the targetcoverage files generated there are duplicates such as: chr1 778263 778638 ENST00000691293 2.008 1.00576 chr1 778263 778638 ENST00000685466 2.008 1.00576

You can see that the same data represents two different transcripts.

What is the best way to address this and maintain all of the data?

Thank you.

tetedange13 commented 2 years ago

Hi @ktr0nimus ,

Regarding target section of CNVkit documentation, you should create your reference using :

baited genomic regions for your target capture kit, as provided by your vendor

=> Which are not supposed to contain duplicate coordinates (as they represent amplicons or capture probes)

If you absolutely want to use your specific BED (which is not what you are supposed to do if you want CNVkit to give you the best results possible) => I guess you will have to first group regions by similar/overlapping coordinates => I think bedtools merge is the proper command to do that (with "-c 4 -o collapse" params to keep track of what was merged, if I am right) => You may also want to merge regions only if coordinates are strictly identical, using grouping feature of tools like mlr or csvtk

Hope this helps. Have a nice day. Felix.

ktr0nimus commented 2 years ago

This is really helpful! I am a novice and want to be "sure" about as much as I can be. Thank you!

Kavi

etal / cnvkit

Value Error: Duplicated genomic coordinates in sample set due to different transcripts #692