Closed vruano closed 8 years ago
The List of Exome bams can be obtained from this spread-sheet here. The all seem to be accessible from the cluster servers.
Al the locations above seem to work.
@cwhelan any updates about GPC2? I guess we still are on the clear to use this dataset for evaluation purposes.
I think they are still fine to use. Do you need something from me to find them?
I had a quick look and seems that the paths are still alright
For @LeeTL1220,
This is my evaluation on the calls in (@cwhelan may be able to confirm whether this is the file to use)
/humgen/cnp04/studies/gpc2/cnv_output_AfAm/results/gs_cnv.genotypes.vcf.gz
Based on the following filters:
-minTruthQual 30 (Genome STRiP GQ >= 30 to consider a truth call for evaluation)
-minCallQual 90 (GATK4 GQ >= 90, this is the "Some Quality" or SQ, to consider a positive call)
-minTruthLen 4 (At least 4 targets covered by the event for the truth event to be considered for evaluation)
-minCallLen 4 (At least 4 targets covered by the called segment to be considered a positive call)
-maxTruthFreq 0.02 (At most 2% across samples {effectively singletons) to consider a truth event in evaluations)
-maxCallFreq 0.02 (At most 2% across {effectively singletons} to consider a segment as a positivie call.
-applyMATFilter (disregard truth call in region that show both dup and dels across samples)
-applyMACFilter (do not regard as positive segment calls when there is samples that support both type of events).
TP = 46 FN = 30 FP = 18
Caveats:
Looking at smaller events min. 1-target events we get:
TP = 86 FN = 227 FP = 18
So there is no added false positive, presumably the SQ >= 90 is good enough to prevent these to crop up. Reducing the threshold to SQ >= 30:
TP = 122 FN = 191 FP = 45
Back to SQ >= 90 with min 2 target events:
TP = 67 FN = 96 FP = 18
And min. 3 target events:
TP = 55 FN = 56 FP = 18
So add on figures from 4 targets down to 1 target are
(target count, dTP, dFN, dFP) (>4, 46, 30, 18) (3, 9, 26, 0) (2, 12, 40, 0) (1, 55, 95, 0)
Interestingly 1 target truth hav a better sensitivity than 2-target and 3-target truth; perhaps some of the caveats mentioned above has to do with that.
I pass this to @davidbenjamin. Please let me know if you need pointers to the data.
Done with original germline model; issue #541 is for new model.
@asmirnov239 This is the original issue/ticket for the analysis. Please read thru it to find out about the origin of the data... I will add more info in another comment.
@asmirnov239 My analysis directory is /dsde/working/valentin/germline-cnv/gpc2 which makes reference to another directory that contains links to the data /dsde/working/valentin/germline-cnv/data/gpc2.
Some file names are self-explanatory but some-other are not so, feel free to come to me if you get lost.
This is the e-mail correspondence on this data-set, it is self explanatory (top is latter, bottom is earlier {kinda-off}).