Separate normal pools for differential coverage calculation and variant filtering

gilhornung commented 6 years ago

Hi Christoffer,

Here is an unexpected question for you :P

As I understand it, superFreq uses the reference normal samples for two purposes: a) To calculate the differential coverage over capture regions: The multiple normals are important for estimating the coverage variance using limma-voom. b) To filter variants: If a certain variant is present in any of the normal it is flagged as "Nnc" or "Nnn".

I am working on a case in which I feel I need to use different pools of normals for these two purposes. I will briefly describe it:

I have data from different batches of FFPE samples, which all passed the same exome capture and sequencing protocol. Each batch has a different quality (possibly due to degradation etc.). When I try to use all normals for the limma-voom step, then I get terrible results with the estimation of fold-changes in coverage. It makes sense, because the large variation between batches distorts the variance estimation of limma-voom. The solution is to use only the normals from the same batch for the differential coverage step. However, since all samples underwent the same exome capturing and sequencing, I still want to use all the normal samples to flag suspicious variants.

Is there a way to use different sets for normals for the differential coverage and variant filtering?

Thanks!

Gil

ChristofferFlensburg commented 6 years ago

Haha yes! So for anyone reading, Gil sent me an email with this question and I asked him to post it here, as there is a solution.

There is support for different sets of reference normals, although it isn't very well documented, sorry. I added it mainly for myself in a similar case, so I’m happy to see that others use it as well. There are definitely cases where this is the best option, as the coverage reference normals ideally should have similar biases in as the studied samples in terms of coverage. So the coverage reference normals should ideally share all biases affecting coverage, such as PCR, fragment length, which lab, the person doing the library prep, capture kit, and so on. You may not always have many samples like this. The reference normals for the SNV calling are mainly picking up rare germline variants and alignment artifacts, which you can use a much wider range of samples for.

To do this in superFreq, use the "normalCoverageDirectory=“ option in the superFreq() function. If not entered it defaults to “normalDirectory=“ and it runs with the same reference normals for coverage and SNV filtering, but if entered (and they are different), then the “normalDirectory=“ will be used for SNV filtering and “normalCoverageDirectory=“ will be used for the coverage part in the CNA calling.

gilhornung commented 6 years ago

Quick question:

In the run script file I wrote:

normalCoverageDirectory = "/bio/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/normals_dir"
normalDirectory = "/bio/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/all_normals"

But when running superFreq I see it the normalCoverageDirectory and normalDirectory are switched:

Normal directory: /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/all_normals
Normal coverage directory: /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/all_normals
dbSNP directory: superFreqDbSNP
capture regions: /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568-batch1/Batches1+2/capture_QC/target_files/sureselect/S04380219_Covered_mergeBed_nochr.bed
Plotting to /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/plots/WS10
Saving R files to /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/Rdata/WS10

What should I trust? which is the true normalCoverageDirectory and normalDirectory ?

Thanks,

Gil

ChristofferFlensburg commented 6 years ago

From the log file both directories are the normalDirectory you defined. Did you pass the normalCoverageDirectory to the superFreq() call? as In

normalCoverageDirectory = "/bio/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/normals_dir"
normalDirectory = "/bio/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/all_normals"

superFreq(allOtherOptions=allOtherOptions,
    normalDirectory=normalDirectory,
    normalCoverageDirectory =normalCoverageDirectory)

gilhornung commented 6 years ago

And now I am getting this error:

Counting normal reads over capture regions.
Error in featureCounts(externalNormalBams, annot.ext = captureAnnotation,  :
  No read files provided!
Error in featureCounts.
Input was
externalNormalBams:
captureAnnotation[1:10,]: WASH7P ? ? OR4F5 ENST00000434264 ENST00000434264 ENST00000434264 FAM87B LINC00115 LINC00115  14620  65509  65831  69481 721381 721530 721851 752916 762095 762280  14712  65625  65973  69600 721519 721806 721942 753035 762275 762414 * * * * * * * * * * 1 1 1 1 1 1 1 1 1 1
Error in runDE(bamFiles, names, externalNormalCoverageBams, captureRegions,  :
  Error in featureCounts of normals.
Calls: superFreq -> superFreq -> analyse -> runDE
Execution halted

gilhornung commented 6 years ago

Oops. I didn't pass the normalCoverageDirectory to the superFreq() call.

That's embarrassing....

ChristofferFlensburg commented 6 years ago

I'm happy I'm not the only one doing these mistakes. :D

gilhornung commented 6 years ago

Hi Christoffer,

Now I am getting this error:

Adding uncalled variants..Found 9786..Found 8507..Found 9042..Found 8777..done!
saving variants to  /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/Rdata/WS9/variants.WS9.Rdata ...done.
Importing dbSNP allele frequencies from superFreqDbSNP/dbAFnew.Rdata...done. Match against variants...done.
Importing exac allele frequencies from superFreqDbSNP/exac.Rdata...done. Match against variants...Saving variants..done.
Plotting frequency distributions to /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/varscan_pipeline/4-superFreq/plots/WS9/diagnostics/frequencyDistribution/..WS9.N..WS9.C..WS9.T..WS9.R..done.
Normal variants from NA.
2017-12-19 05:34:12
Error in grepl("^chr", names(scanBamHeader(file)[[1]]$targets)) :
  error in evaluating the argument 'x' in selecting a method for function 'grepl': Error in BamFile(file, character(0)) :
  'file' must be character(1) and not NA
Calls: scanBamHeader ... scanBamHeader -> lapply -> lapply -> FUN -> open -> BamFile
Calls: superFreq ... newBamToVariants -> lapply -> lapply -> FUN -> grepl
Execution halted

ChristofferFlensburg commented 6 years ago

Seems like it can't find the reference normal bam files... Earlier in the log it lists the reference normal and reference coverage normal (search for "Normal bamfiles are") bam files found. Maybe check that they are there, and if not, check that there are bam files in the reference normal directories you point to.

gilhornung commented 6 years ago

indeed it can't find the bam files. Maybe it is because I am using soft links?

Normal bamfiles are:

Normal coverage bamfiles are:
/gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/sorted.ddup.subset1.bam
/gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/sorted.ddup.subset2.bam

ChristofferFlensburg commented 6 years ago

Soft links should work, they are resolved before superFreq reads from the files. I use them a lot myself. Not sure what can go wrong here. I guess you already double-checked for typos in the path and the files actually being there, and so on. Permissions in order? The part in the code finding the bams is

  externalNormalBams =
    normalizePath(c(list.files(path=paste0(normalDirectory), pattern = '*.bam$', full.names=T),
list.files(path=paste0(normalDirectory, '/bam'), pattern = '*.bam$', full.names=T)))

You can try to run that directly in R after you defined your normalDirectory, and see if your files turn up.

gilhornung commented 6 years ago

I called the subdir bams instead of bam.

Not my day...

ChristofferFlensburg commented 6 years ago

Ok, good that you found it! :)

SuperFreq should really give a more informative error when it cant find any reference normals though. Actually let's leave this ticket open for that purpose, and I'll add something after end of year holidays.

ChristofferFlensburg commented 6 years ago

Added a check with more informative error message when less than 2 reference normals. Will go online next version.

ChristofferFlensburg / superFreq

Separate normal pools for differential coverage calculation and variant filtering #29