broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Evaluation using GPC2 data. #299

Closed vruano closed 8 years ago

vruano commented 8 years ago

This is the e-mail correspondence on this data-set, it is self explanatory (top is latter, bottom is earlier {kinda-off}).

Hi Valentin,

Bob found 92 samples from the GPC2 cohort that have both WGS and exome data. This might make be another good set to look at XHMM vs Genome STRiP calls in. I don't know if the fact that the exomes are nextera would be an obstacle in terms of finding other exome BAMs to throw in to XHMM.

Let me know if you are interested in running XHMM on these and comparing to Genome STRiP results.

Chris


Hi, Chris, There are 92 gpc2 samples that have both exome and WGS sequencing (this is using the nextera capture). These might also be a useful resource for benchmarking XHMM. We have excellent WGS calls on these samples. -Bob


Hi Bob,

Ashley forwarded me your email regarding the GPC plate that had Whole Genomes and Whole Exomes. Sorry for the confusion, as that will soon apply to the PIC samples. ;)

Here is an email that should help you. All samples are African American. Attached is a file the BAM file paths for the exomes. The VCF for these exomes can be found here: /seq/dax/C1813/Exome/v1/C1813.vcf.gz

VCF for the Whole Genomes can be found here (this file path was from Jim): /psych/genetics_data/stored_data/seq/wgs/gpc/freeze5/UG_results/gpc_freeze5.UnifiedGenotyper.recalibrated.vcf.gz

I believe you already have access to the Whole Genome BAMs, but if you need file paths, just let me know.

Best, Kim


Yossi showed me once that you can find the target intervals in the

.hybrid_selection_metrics file in the /seq bam directory. For these it looks like it is: /seq/references/HybSelOligos/whole_exome_illumina_coding_v1/whole_exome_illumina_coding_v1.Homo_sapiens_assembly19.targets.interval_list --- You can find a decent GS callset for the GPC2 cohort (500 samples including these) here: /humgen/cnp04/sandbox/skashin/projects/gpc2/callset/gs_cnv.filtered.vcf Those calls are filtered to have a 3% FDR.
vruano commented 8 years ago

The List of Exome bams can be obtained from this spread-sheet here. The all seem to be accessible from the cluster servers.

Al the locations above seem to work.

vruano commented 8 years ago

@cwhelan any updates about GPC2? I guess we still are on the clear to use this dataset for evaluation purposes.

cwhelan commented 8 years ago

I think they are still fine to use. Do you need something from me to find them?

vruano commented 8 years ago

I had a quick look and seems that the paths are still alright

vruano commented 8 years ago

For @LeeTL1220,

This is my evaluation on the calls in (@cwhelan may be able to confirm whether this is the file to use)

/humgen/cnp04/studies/gpc2/cnv_output_AfAm/results/gs_cnv.genotypes.vcf.gz

Based on the following filters:

-minTruthQual 30 (Genome STRiP GQ >= 30 to consider a truth call for evaluation) 
-minCallQual 90 (GATK4 GQ >= 90, this is the "Some Quality" or SQ, to consider a positive call)
-minTruthLen 4 (At least 4 targets covered by the event for the truth event to be considered for evaluation)
-minCallLen 4 (At least 4 targets covered by the called segment to be considered a positive call)
-maxTruthFreq 0.02 (At most 2% across samples {effectively singletons) to consider a truth event in evaluations)
-maxCallFreq 0.02 (At most 2% across {effectively singletons} to consider a segment as a positivie call.
-applyMATFilter (disregard truth call in region that show both dup and dels across samples)
-applyMACFilter (do not regard as positive segment calls when there is samples that support both type of events).

TP = 46 FN = 30 FP = 18

Caveats:

  1. We need to refine the way we filter the truth as per @cwhelan advice we should consider to filter base on the GSCNQUAL Info annotation. Also the 30 threshold may need to a bit higher.
  2. At this moment we don't care what is the overlap between truth and called segment in terms of shared targets; it is rather expected that the boundaries between truth and call may not match perfectly due to noise, but at this point we simply accent even only a single target overlap as good to match truth and calls; these may share just a small number of targets (intersection) out of a large number (union).
  3. Some truth events close but not adjacent in genome coordinate space may actually be adjacent in target space. In such case we cannot expect the exome based tool to do any better than calling them all as part of a larger event. Currently we would count those truth event as independent TP cases but perhaps we should consider them as a single TP.

Looking at smaller events min. 1-target events we get:

TP = 86 FN = 227 FP = 18

So there is no added false positive, presumably the SQ >= 90 is good enough to prevent these to crop up. Reducing the threshold to SQ >= 30:

TP = 122 FN = 191 FP = 45

Back to SQ >= 90 with min 2 target events:

TP = 67 FN = 96 FP = 18

And min. 3 target events:

TP = 55 FN = 56 FP = 18

So add on figures from 4 targets down to 1 target are

(target count, dTP, dFN, dFP) (>4, 46, 30, 18) (3, 9, 26, 0) (2, 12, 40, 0) (1, 55, 95, 0)

Interestingly 1 target truth hav a better sensitivity than 2-target and 3-target truth; perhaps some of the caveats mentioned above has to do with that.

vruano commented 8 years ago

I pass this to @davidbenjamin. Please let me know if you need pointers to the data.

davidbenjamin commented 8 years ago

Done with original germline model; issue #541 is for new model.

vruano commented 8 years ago

@asmirnov239 This is the original issue/ticket for the analysis. Please read thru it to find out about the origin of the data... I will add more info in another comment.

vruano commented 8 years ago

@asmirnov239 My analysis directory is /dsde/working/valentin/germline-cnv/gpc2 which makes reference to another directory that contains links to the data /dsde/working/valentin/germline-cnv/data/gpc2.

Some file names are self-explanatory but some-other are not so, feel free to come to me if you get lost.