Closed SophieS9 closed 6 months ago
Hello,
Thank you for including the VCF with your ticket. It makes it so much easier for me to work on.
The VCF's format is compliant to VCF standards. However, the e.g. <STR20>
type annotations are not handled by truvari. Truvari assumes that variants are resolved whereas this VCF has CNV-like information. However, we can make this VCF truvari compatible by running a simple script that sets the SVTYPE/SVLEN.
import pysam
vcf = pysam.VariantFile("original_sample_split.vcf")
# Add the new header line to the VCF. SVTYPE is already in the VCF, though it is not used
header = vcf.header.copy()
header.add_line('##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Alternate length - Reference length">')
output = pysam.VariantFile("/dev/stdout", 'w', header=header)
for entry in vcf:
if entry.alts is None:
## Reference monozygotic seems to happen sometimes. We can't alter those
output.write(entry)
continue
# parse the ref/alt alleles
alt = entry.alts[0]
alt_copy_number = int(alt[len("<STR"):-1])
ref_copy_number = entry.info["REF"]
# The reference and alternate sequences *should* just be some number of motifs
motif = entry.info['RU']
motif_len = len(motif)
ref_len = ref_copy_number * motif_len
alt_len = alt_copy_number * motif_len
# We can set the length as the ALT - REF, and when length is < 0 it is a DEL, > 0 an INS
svlen = alt_len - ref_len
if svlen > 0:
svtype = "INS"
else:
svtype = "DEL"
# output
entry.translate(header)
entry.info['SVLEN'] = svlen
entry.info['SVTYPE'] = svtype
output.write(entry)
I made some assumptions here, so please check that this is correct behavior before committing to this script. The main point is that this style of VCF is not handled by Truvari and TR expansions/contractions can be represented as insertions/deletions just fine.
Now we just convert the VCF
python converter.py | bgzip > typed_trs.vcf.gz
tabix typed_trs.vcf.gz
For speed, we're just going to subset the benchmark to only the regions analyzed by the caller.
This is going to inflate the recall. The correct way to do this would be to intersect ExpansionHunter's bed file of
regions it analyzed with the benchmark's regions to create the subset.regions.bed
bedtools intersect -u -a GIABTR.HG002.benchmark.regions.bed.gz -b typed_trs.vcf.gz > subset.regions.bed
Finally, we run truvari with the recommended parameters from the benchmark's README. Note that we also added --pctseq 0
since these are not sequence resolved calls. This also means the variants cannot be run through truvari refine
truvari bench -b GIABTR.HG002.benchmark.vcf.gz --includebed subset.regions.bed \
-s 5 --pick ac --pctseq 0 -c typed_trs.vcf.gz -o bench_result
Results:
{
"TP-base": 18,
"TP-comp": 17,
"FP": 7,
"FN": 4,
"precision": 0.7083333333333334,
"recall": 0.8181818181818182,
"f1": 0.7593052109181142,
"base cnt": 22,
"comp cnt": 24,
"TP-comp_TP-gt": 14,
"TP-comp_FP-gt": 3,
"TP-base_TP-gt": 14,
"TP-base_FP-gt": 4,
"gt_concordance": 0.8235294117647058,
"gt_matrix": {
"(1, 0)": {
"(0, 1)": 3,
"(1, 0)": 2,
"(1, 1)": 1
},
"(0, 1)": {
"(0, 1)": 3,
"(1, 0)": 1,
"(1, 1)": 2
},
"(1, 1)": {
"(1, 1)": 5
},
"(1, None)": {
"(1, 1)": 1
}
}
}
Thanks so much for the detailed reply @ACEnglish! All makes sense!
Hi, I'm currently also looking at STRs from ExpansionHunter (DRAGEN version and the standalone). I am a bit confused about why it's necessary to intersect the ExpansionHunter BED file with the GIAB TR BED file. Is this not explicitely handled by truvari refine
? Basically I would expect the following two approaches generating the same results:
1:
truvari bench -b GIABTR.HG002.benchmark.vcf.gz --includebed GIABTR.HG002.benchmark.regions.bed.gz \
-s 5 --pick ac --pctseq 0 -c typed_trs.vcf.gz -o bench_result
truvari refine --use-original-vcfs --regions ExpansionHunter_Catalog.bed bench_result/
2:
bedtools intersect -u -a GIABTR.HG002.benchmark.regions.bed.gz -b ExpansionHunter_Catalog.bed > subset.regions.bed
truvari bench -b GIABTR.HG002.benchmark.vcf.gz --includebed subset.regions.bed \
-s 5 --pick ac --pctseq 0 -c typed_trs.vcf.gz -o bench_result
truvari refine --regions subset_regions.bed bench_result/
Is my understanding correct? Best, Raphael
Edit:
In the refine
wiki page, I found this paragraph:
Therefore, you can run bench on the full benchmark's regions (--includebed), and automatically subset to only the regions analyzed by the caller with refine --regions.
Yes, refine does as you've described. The subsetting I was describing was as it relates to the bench output.
HI @ACEnglish - thanks for the brilliant tool and the TR benchmarking work!
I have a query regarding the benchmarking files, Truvari and outputs from Expansion Hunter, specifically looking at STRs. We've sequenced HG002 and produced an alignment file using the Illumina DRAGEN platform. I've then run this through Expansion Hunter and I wanted to use Truvari and the TR benchmarking files to get some precision/recall metrics for this.
I've run as follows (I had to put --pctseq 0 and remove other parameters), after splitting multiallelic calls with bcftools norm:
As you can see, I'm getting TP as 0, but from eyeballing my VCF file from Expansion Hunter and from the benchmark VCF, I'd expect at least some TPs. Here's an example:
Is this a formatting issue as I've called STR but the benchmark VCF doesn't? I took a look and there are no SVTYPE=STR in the benchmark VCF, so I'm wondering if I'm even using it correctly?!! Is it just that I need to run truvari refine as well?
Any thoughts/advice are welcome! I've attached my VCF for reference (changed to txt just for upload purposes).
Thanks
Sophie
original_sample_split.txt