Illumina / hap.py

Haplotype VCF comparison tools
Other
406 stars 125 forks source link

happy treats variants that aren't fully within the confident regions as confident #34

Open depristo opened 6 years ago

depristo commented 6 years ago

In our deep dive into happy with DeepVariant we've noticed that hap.py includes in its confident calls for comparison any multi-base event (e.g., a deletion) that merely overlaps with the confident regions, rather than requiring that event to be full contained within the confident regions. Internally we require the event to be fully contained within the confident regions to be considered confident. We wanted to double check here that (a) hap.py intended this behavior and (b) check this is the recommended approach (overlaps, not contains) according to the GA4GH benchmarking recommendations?

pkrusche commented 6 years ago

I think we discussed this in the GA4GH benchmarking group and the conclusion was that this is more of a truthset rather than a comparison issue. Hap.py does stratify these out using the TS_contained (which requires the events to be fully contained) and TS_boundary tags in the VCF (and provides the corresponding precision/recall metrics in the .extended.csv file).

The reason we include both these event types in the total by default is the following:

depristo commented 6 years ago

Here's a concrete example:

candidate: no call is present
truth:
20      16755600        rs369843912     AAGAAAAAAG      A       50      PASS    .   GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0/1:228:130,71:145,77:366:0/1:.:.

happy:
20      16755600        .       AAGAAAAAAG      A       50      .       BS=16755600;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ

BED confident region:
20      16755608        16790996

Happy is tagging this variant as TS_contained but that's clearly not the case here.

Here's our exact command line:

sudo docker run -it \
-v "${ROOT}:${ROOT}" \
-v "${OUTPUT_DIR}:${OUTPUT_DIR}" \
pkrusche/hap.py /opt/hap.py/bin/hap.py \
  "${TRUTH_VCF}" \
  "${LABELED_VCF}" \
  --preprocess-truth \
  -f "${TRUTH_BED}" \
  -r "${REF%.gz}" \
  -o "${OUTPUT_DIR}/happy.output" \
  -l ${REGION}

Picture of the region (confident region BED is on the bottom): screen shot 2018-01-11 at 4 01 59 pm

This variant is from HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz and it's associated BED file.

pkrusche commented 6 years ago

This is probably the result of hap.py 'fixing' the confident regions after preprocessing the truth VCF (the corresponding command line options are --adjust-conf-regions and --no-adjust-conf-regions, the default is on to work around issues with older Platinum Genomes releases) to make sure that any truth variant that just overlaps the truth confident regions will be fully contained.

In the GA4GH benchmarking setting we'd typically not preprocess the truth calls (i.e. the truth calls are treated as the 'gold standard' which we shouldn't really change), but we also use this type of preprocessing for internal assessment / methods improvement -- unfortunately, it adds some corner cases just like the one above.

I think running with --no-adjust-conf-regions should at least label the call above correctly.