ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

river output CNV ploidy #77

Open guillaume-rs opened 3 years ago

guillaume-rs commented 3 years ago

Hi, Sorry it's me again with another question... I've seen on the river plot that I find CNV like "1Mbp 27AB". I was wondering what is the meaning of 27AB, are there 27 copies of alleles A and B? (which seems unlikely) Thanks,

Guillaume

ChristofferFlensburg commented 3 years ago

It means 27 copies of A and one copy of B, so a short for AAAAAAAAAAAAAAAAAAAAAAAAAB (if I got that right). I realise now it's ambiguous, but not sure how to make it more clear. 🤔

Read depth is compared to reference normals, so if you believe it's not a real call, might be worth opening the test sample and a couple reference normals in IGV and have a look at what is going on. We've had cases where the references had deletions (in HLA or IG regions for example) which caused small false amplification calls in the test samples.

Otherwise, go have a look in plots/myIndividual/data/CNAsegments*.tsv and it'll tell you which genes (and which COSMIC census genes in last column) are in the segment. Never thought I'd say this, but that file actually reads pretty well in excel. That could help you tell if it's a real amplification of an oncogene.

guillaume-rs commented 3 years ago

Thank you for your explanation :+1:

Have you ever encountered such "extreme" amplification event? (I've got up to 56AB)

Looking on these positions on IGV I've found out that my normal bams have a lot of reads mapping on intergenic and intronic regions (which is unexpected for exome seq), while my normal has very few, could this induce false positives amplification prediction?

ChristofferFlensburg commented 3 years ago

Yeah it does happen that you get hundreds of copies of oncogenes sometimes. It probably happens more often that QC issues give you false calls though. :P

The read counts go up to 300bp away from the exons, so high levels of intronic/intergenic reads can confuse the counts and lead to issues like this. If that is present across the entire genome, might be a problem with the exome capture... Might be worth talking to the biologists that generated the data if you think that is the case.

guillaume-rs commented 3 years ago

Ok it's reassuring to know that such events could be normal :) For the bam files I've got high read counts even far from exons (between 10x and 20x on average), would that make any sense to filter the bam files to keep only reads mapping on exons, and giving it as input to superfreq?

ChristofferFlensburg commented 3 years ago

If you have 10-20x coverage across the entire genome with extra bumps over the exons, then it seems like you've got sequencing data from both a genome and an exome. I can't QC your data, but I'd double check the meta data, and maybe check with the data generators as well. You can try filtering, but if the reference normals are clean exomes, you will likely still get noisy results from the copy number calling.

guillaume-rs commented 3 years ago

Thank you for your advice! I will try to filter the bam to see if it affects this issue.

I've notice that increasing the systematicVariance parameter decreases the unexpected CNVs detection.

I'm not sure how much I can increase it without impairing too much the clone prediction, is there a maximum value you would recommand not to cross?

ChristofferFlensburg commented 3 years ago

I think exome default is 0.02, RNA-Seq default is 0.1. So I think up to 0.1 is fine and you should still have decent sensitivity. If a lot of noise, might try up to 0.2.

But yeah, personally I would get to the bottom of what is going on with all the off-target reads, or I wouldn't trust any of the results.