Create a control dataset using normal samples: What file type and how many samples?

moldach commented 4 years ago

1). What type of files need to go in the folder for normal samples, .bam or .fastq files?

2). How many unmatched normal samples should be used?

rsteinfe1 commented 4 years ago

Hi moldach. I was able to generate the normal PoN by pointing to the output of bam2roi.r. Just make sure that you use only *_roiSummart,txt files from normal bam files. Generally, when you call sCNV the more samples you have that match your tumor sequencing and mapping protocol the better. GATK has good recommendations: https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON-#:~:text=Panel%20of%20Normals%20(PON)%20Follow&text=A%20Panel%20of%20Normal%20or,used%20in%20somatic%20variant%20analysis.&text=It's%20very%20important%20to%20use,sequencing%20technology%20and%20so%20on).

I would think that similar guidelines apply to CNV Radar. I don't know, however, how CNV Radar handles sex chromosomes. From the code I would assume that they are skipped. So potentially, you you might not have to split your normals by gender.

jkstrat commented 3 years ago

Hi moldach,

I'll echo what rsteinfe1 has said. The purpose of the reference panel is to correct for bias that is introduced during the processing of samples. There are sources of bias that come from doing targeted sequencing on samples that affect the observed read depth for a sample. Several of the sources of bias are known (affinity of the capture probes), however there are also several unknown sources. Normalizing the read depths against a reference set (samples believed to be copy number neutral) allows for more accurate CNV calling. The reference set provides a baseline copy-neutral read depth that is used to compare whether the read depth at a given position in the target sample is also copy neutral or a copy number variant. The general rule about the reference cohort is to have it be as close as possible to the target cohort in terms of sample origin and processing whenever possible. We recommend this approach as any bias (both known and unknown sources) in the read depths would be adjusted for in the model.

So having more samples is better than having few. However, we have found that 10-20 samples is usually enough to capture most of the consistent characteristics and biases introduced during the lab process. For each additional sample beyond that you will get some small gains in accuracy but the computational cost is quite high. So it's a balance. You will see in our new vignette that we did CNV calling with a reference panel of 7 samples. However, for any clinical pipelines we'd suggest you build a reference panel with all available samples believed to be copy neutral.

rsteinfe1 is correct in that at this time CNV Radar does not make CNV calls on the sex chromosomes so there is no need to create a gender stratified reference.

We just released v1.2.0 code this morning and we're working on getting our docker up shortly. There are some new workflow diagrams that I think will be useful for running CNV Radar. The ROI summaries are required for creating the normal reference.

Regards, Jeran

ExpressionAnalysis / CNV_Radar

Create a control dataset using normal samples: What file type and how many samples? #5