ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

superFreq counts too many var reads at a somatic position #14

Closed gilhornung closed 7 years ago

gilhornung commented 7 years ago

Hi Christoffer,

I have been reviewing some of the variants that superFreq outputs, and noticed the following discrepancy: In sample WS10-C I observe 5 var A alleles using IGV: igv I also verified this using samtools mpileup.

But superFreq reports 13 var reads:

chr start end reference variant inGene severity effect f cov ref var
14 106780634 106780634 G A IGHV4-28 10.816 missense_variant 0.121495 107 94 13

Here is a link to the bam file: https://owncloud.incpm.weizmann.ac.il/index.php/s/P9uVBlxp2JUbtEc

A link to the R dir: https://owncloud.incpm.weizmann.ac.il/index.php/s/n34Yc3rMyA1ODOl

A link to the plots dir: https://owncloud.incpm.weizmann.ac.il/index.php/s/z7QrJrrxwGe4Mqx

The links are password protected. I will send the password via e-mail

All the best,

Gil

ChristofferFlensburg commented 7 years ago

Hey Gil,

I think I have had the same issue myself. :)

IGV only shows reads with alignment score larger than 10 by default I think, and samtools mpileup also has some cuts on mapping and base quality by default. Once a variant has passed the filters, superFreq counts all reads (although downweighted by the phred probability from base and mapping quality). So probably there are a bunch of low mapping quality reads that don't show up in IGV. There is a setting somewhere in IGV that switches that of, and then the counts should mostly agree.

If that isn't the problem, I'll have to look closer.

cheers /Christoffer

gilhornung commented 7 years ago

Hi Christoffer,

I double checked and the mapq threthold in IGV was 0. Also in samtools:

samtools mpileup -Q 0 -r 14:106780634-106780634 sorted.ddup.bam | awk '{print $5}' | perl -ne '$count = () = $_=~/a/ig; print $count."\n"'
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
5
gilhornung commented 7 years ago

Another example, this time for WS4:

superFreq reports 7 variant reads, but I only see 2:

samtools mpileup -Q 0 -q 0 -r 12:120534008-120534008 sorted.ddup.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
12      120534008       N       15      tccCccCctcCccCC FFIFIIFIIIIIFFB
ChristofferFlensburg commented 7 years ago

Ok, so I finally got to fix this. Essentially solution is to run 0.9.19 or later, where the variant read counts are done through samtools mpileup instead of read from the .bam directly. This position reports 4 variant reads (two out of the 5 are on the two reads of the same fragment), so whatever the error was, it was in the no longer used part that read from the .bam.

It's a bit worrying that something was off in the older version, but I don't really have time to go and chase that down. We have run on a lot of samples that closely match other methods (like mutect) numbers, as well as wet-lab verification, so can't have been too widespread serious problem... And I'm sure there are still bugs in there for that matter. :P

Feel free to close this if the rerun gives desired counts.

gilhornung commented 7 years ago

Thanks, Much appreciated. I will test this when I get back.

Gil

On Mon, Aug 28, 2017 at 11:06 AM, Christoffer Flensburg < notifications@github.com> wrote:

Ok, so I finally got to fix this. Essentially solution is to run 0.9.19 or later, where the variant read counts are done through samtools mpileup instead of read from the .bam directly. This position reports 4 variant reads (two out of the 5 are on the two reads of the same fragment), so whatever the error was, it was in the no longer used part that read from the .bam.

It's a bit worrying that something was off in the older version, but I don't really have time to go and chase that down. We have run on a lot of samples that closely match other methods (like mutect) numbers, as well as wet-lab verification, so can't have been too widespread serious problem... And I'm sure there are still bugs in there for that matter. :P

Feel free to close this if the rerun gives desired counts.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ChristofferFlensburg/superFreq/issues/14#issuecomment-325288452, or mute the thread https://github.com/notifications/unsubscribe-auth/AYRgKUqnp7kEwFvw51AgPnhbfplKJzBaks5scnUYgaJpZM4OoYOD .

gilhornung commented 7 years ago

Hi Christoffer,

I'm trying to run the new version of superFreq and I'm getting this error message: Importing dbSNP allele frequencies from superFreqDbSNP/dbAFnew.Rdata...Error in load(RsaveFile) : empty (zero-byte) input file

I think it is because the dbSNP file was not downloaded.

Here are the relevant error messages regarding downloading dbSNP:

Sat Sep 16 14:13:25 2017 : WS4 ...Creating superFreq dbSNP annotation directory superFreqDbSNP . --2017-09-16 14:13:25-- http://gitlab.wehi.edu.au/flensburg.c/superFreq/raw/master/dbSNP/dbAFnew.Rdata Resolving gitlab.wehi.edu.au... 128.250.252.207 Connecting to gitlab.wehi.edu.au|128.250.252.207|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://gitlab.wehi.edu.au:443/flensburg.c/superFreq/raw/master/dbSNP/dbAFnew.Rdata [following] --2017-09-16 14:13:28-- https://gitlab.wehi.edu.au/flensburg.c/superFreq/raw/master/dbSNP/dbAFnew.Rdata Connecting to gitlab.wehi.edu.au|128.250.252.207|:443... connected. OpenSSL: error:100AE081:elliptic curve routines:EC_GROUP_new_by_curve_name:unknown group OpenSSL: error:1408D010:SSL routines:SSL3_GET_KEY_EXCHANGE:EC lib Unable to establish SSL connection.

Thanks,

Gil

gilhornung commented 7 years ago

I bypassed the dbSNP error by downloading the dbAFnew.Rdata and exac.Rdata files manually. However, now I have new error:

Variants for WS4
Loading saved variants for WS4
Importing dbSNP allele frequencies from superFreqDbSNP/dbAFnew.Rdata...done.
Match against variants...done.

Importing exac allele frequencies from superFreqDbSNP/exac.Rdata...done.
Match against variants...Error in .Method(..., na.last = na.last, decreasing = decreasing) :
  argument 1 is not a vector
ChristofferFlensburg commented 7 years ago

Hi Gil!

Hmm, we had some issues with the resource download acting up a couple weeks ago, but I thought that was resolved. I've changed from wget to libcurl which may be more robust... As you already got the file, I won't push it online immediately but it'll be in 0.9.22. Let me know if you want it now.

For the error, I think something is odd with your variants. Seems like the error is from a sort command, and the only sort command between your last output and the next one is a sort over the genome of the variants, which shouldn't depend on exac. Although it is a suspicious coincidence that you just had problems with exac and now suddenly you get an error in that part of the run.

So maybe you can try to rerun with forceRedoVariants and forceRedoNormalVariants (just to be sure)? If problem persist, I think I'll need to look closer at what is going on.

gilhornung commented 7 years ago

Hi,

I tried redoing with

forceRedoVariants = TRUE
forceRedoNormalVariants = TRUE```

But superFreq still fails. below is the output. I will send you an example of a vcf by e-mail.

For the sake of being organized, maybe we should move this discussion to a different issue?

> source("run_superfreq_WS4.R")
Splitting meta data into participants.
Loading sample meta data from file...done.
Standardised sample names to:
WS4.N
WS4.C
WS4.T
WS4.R

Planning to run over these participants:
WS4
Now running:
Mon Sep 18 21:44:58 2017 :  WS4 ...

 2017-09-18 21:44:59
######################################################################
Running superFreq version 0.9.21
Runtime tracking and QC information printed to /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run2/R/WS4/runtimeTracking.log.
Starting run with input files:
sampleMetaDataFile: /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run2/splitMetaData/WS4.tsv
vcfFiles:

Normal directory: /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/normals_dir
Normal coverage directory: /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/normals_dir
dbSNP directory: superFreqDbSNP
capture regions: /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568-batch1/Batches1+2/capture_QC/target_files/sureselect/S04380219_Covered_mergeBed_nochr.bed
Plotting to /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run2/plots/WS4
Saving R files to /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run2/R/WS4
Genome is hg19
Running on at most 24 cpus.

Parameters for this run are:
   maxCov:              150
   systematicVariance:  0.06
   cloneDistanceCut:  2.326348

Normal bamfiles are:
/gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/sorted.ddup.subset1.bam
/gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/sorted.ddup.subset2.bam
Normal coverage bamfiles are:
/gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/sorted.ddup.subset1.bam
/gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/sorted.ddup.subset2.bam
Loading capture regions..done.
Imported capture regions with 286754 regions and 22717 unique gene names.
Mean GC content is 0.472.
Loading sample meta data from file...done.
Deciding which pairs to scatter plot..done.
Deciding which time series to plot..done.
##################################################################################################

 2017-09-18 21:44:59
 Imported and sanity checked meta data. Looking good so far!
 metadata:
   BAM   VCF   INDIVIDUAL   NAME   TIMEPOINT   NORMAL
   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/sorted.ddup.bam   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-N/biallelic.vcf   WS4   WS4.N   N   YES
   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-C/sorted.ddup.bam   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_pair_data/WS4-C_vs_WS4-N/superFreq_input.vcf   WS4   WS4.C   C   NO
   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-T/sorted.ddup.bam   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_pair_data/WS4-T_vs_WS4-N/superFreq_input.vcf   WS4   WS4.T   T   NO
   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_sample_data/WS4-R/sorted.ddup.bam   /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run1/per_indiv/WS4/per_pair_data/WS4-R_vs_WS4-N/superFreq_input.vcf   WS4   WS4.R   R   NO

 timeSeries:
   WS4.N   WS4.C   WS4.T   WS4.R

##################################################################################################

Starting differential coverage analysis by sample.
Loading saved differential coverage results.
Loaded saved differential coverage results
Plotting volcanoes to /gpfs/units/bioinformatics/projects/Sheba/MayaDadiani/INCPMPM-1568/analysis/batch1/run2/plots/WS4/volcanoes/..done!
Using variants by individual.

Variants for WS4
Loading saved variants for WS4
Importing dbSNP allele frequencies from superFreqDbSNP/dbAFnew.Rdata...done. Match against variants...done.
Importing exac allele frequencies from superFreqDbSNP/exac.Rdata...done. Match against variants...Error in .Method(..., na.last = na.last, decreasing = decreasing) :
  argument 1 is not a vector
gilhornung commented 7 years ago

It seems that the new superfreq version solved the bug.

Below is a plot with the frequency of the somatic variants of one on the patients as it was calculated using mutect2, or the two versions of superfreq. The new version is much more consistent with mutect2, and also the case described in the posts above also seems OK in the new version.

image

I will close the issue.

Thanks!

Gil