ACEnglish / truvari

Structural variant toolkit for VCFs
MIT License
321 stars 48 forks source link

Truvari, STRs and Expansion Hunter - Query #212

Closed SophieS9 closed 6 months ago

SophieS9 commented 6 months ago

HI @ACEnglish - thanks for the brilliant tool and the TR benchmarking work!

I have a query regarding the benchmarking files, Truvari and outputs from Expansion Hunter, specifically looking at STRs. We've sequenced HG002 and produced an alignment file using the Illumina DRAGEN platform. I've then run this through Expansion Hunter and I wanted to use Truvari and the TR benchmarking files to get some precision/recall metrics for this.

I've run as follows (I had to put --pctseq 0 and remove other parameters), after splitting multiallelic calls with bcftools norm:

2024-04-24 14:50:43,875 [INFO] Truvari v4.2.2
2024-04-24 14:50:43,875 [INFO] Command /home/v.so257591/.conda/envs/truvari/bin/truvari bench --pctseq 0 -b ../HG002_
GRCh38_TandemRepeats_v1.0.1.vcf.gz -c original_sample_split.vcf.gz --includebed ../HG002_GRCh38_TandemRepeats_v1.0.1_
Tier1.bed.gz -o Original_Bench_Results/ -f /data/resources/human/references/GRCh38_masked/GRCh38_full_analysis_set_pl
us_decoy_hla_masked.fa
2024-04-24 14:50:43,875 [INFO] Params:
{
    "base": "/mnt/HomeApps/Home/v.so257591/STR_Validation/HG002_GRCh38_TandemRepeats_v1.0.1.vcf.gz",
    "comp": "/mnt/HomeApps/Home/v.so257591/STR_Validation/Version4/original_sample_split.vcf.gz",
    "output": "Original_Bench_Results/",
    "includebed": "/mnt/HomeApps/Home/v.so257591/STR_Validation/HG002_GRCh38_TandemRepeats_v1.0.1_Tier1.bed.gz",
    "extend": 0,
    "debug": false,
    "reference": "/data/resources/human/references/GRCh38_masked/GRCh38_full_analysis_set_plus_decoy_hla_masked.fa",
    "refdist": 500,
    "pctseq": 0.0,
    "minhaplen": 50,
    "pctsize": 0.7,
    "pctovl": 0.0,
    "typeignore": false,
    "chunksize": 1000,
    "bSample": "HG002",
    "cSample": "211028_A00748_0168_AHFYNFDSX2_20M05405-1",
    "dup_to_ins": false,
    "sizemin": 50,
    "sizefilt": 30,
    "sizemax": 50000,
    "passonly": false,
    "no_ref": false,
    "pick": "single",
    "check_monref": true,
    "check_multi": true
}
2024-04-24 14:51:34,743 [INFO] Including 1638106 bed regions
2024-04-24 14:52:07,773 [INFO] Found 9 chromosomes with overlapping regions
2024-04-24 14:53:06,778 [INFO] Zipped 1196641 variants Counter({'base': 1196606, 'comp': 35})
2024-04-24 14:53:06,778 [INFO] 11344 chunks of 1196641 variants Counter({'__filtered': 1180470, 'base': 16149, 'comp'
: 22})
2024-04-24 14:53:08,386 [INFO] Stats: {
    "TP-base": 0,
    "TP-comp": 0,
    "FP": 12,
    "FN": 16149,
    "precision": 0.0,
    "recall": 0.0,
    "f1": null,
    "base cnt": 16149,
    "comp cnt": 12,
    "TP-comp_TP-gt": 0,
    "TP-comp_FP-gt": 0,
    "TP-base_TP-gt": 0,
    "TP-base_FP-gt": 0,
    "gt_concordance": 0,
    "gt_matrix": {}
}
2024-04-24 14:53:08,386 [INFO] Finished bench

As you can see, I'm getting TP as 0, but from eyeballing my VCF file from Expansion Hunter and from the benchmark VCF, I'd expect at least some TPs. Here's an example:

#My Expansion hunter VCF
chr2    190880872       .       C       <STR8>  .       PASS    END=190880920;REF=16;RL=48;RU=GCA;VARID=GLS;REPID=GLS
        GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC     1/0:SPANNING/SPANNING:8/14:8-8/14-14:15/16:11/14:0/0:33.8096
chr2    190880872       .       C       <STR14> .       PASS    END=190880920;REF=16;RL=48;RU=GCA;VARID=GLS;REPID=GLS
        GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC     0/1:SPANNING/SPANNING:8/14:8-8/14-14:15/16:11/14:0/0:33.8096

#Benchmark VCF
chr2    190880872       .       CGCAGCA C       60      PASS    AN=2;AC=1;NS=86;AF=0.168605;MAF=0.168605;AC_Het=27;AC_Hom=2;AC_Hemi=0;HWE=0.446931;ExcHet=0.260788   GT:BPDP:FT:AD   0/1:1,1,1,1:PASS:1,1
chr2    190880872       .       CGCAGCAGCAGCAGCAGCAGCAGCA       C       60      PASS    SVTYPE=DEL;SVLEN=24;AN=2;AC=1;NS=86;AF=0.244186;MAF=0.244186;AC_Het=34;AC_Hom=8;AC_Hemi=0;HWE=0.769395;ExcHet=0.386269       GT:BPDP:FT:AD   1/0:1,1,1,1:PASS:1,1

#Benchmark BED
chr2    190880842       190880919       Tier1   TP_TP_TP        13      0.8795  24      6

Is this a formatting issue as I've called STR but the benchmark VCF doesn't? I took a look and there are no SVTYPE=STR in the benchmark VCF, so I'm wondering if I'm even using it correctly?!! Is it just that I need to run truvari refine as well?

Any thoughts/advice are welcome! I've attached my VCF for reference (changed to txt just for upload purposes).

Thanks

Sophie

original_sample_split.txt

ACEnglish commented 6 months ago

Hello,

Thank you for including the VCF with your ticket. It makes it so much easier for me to work on.

The VCF's format is compliant to VCF standards. However, the e.g. <STR20> type annotations are not handled by truvari. Truvari assumes that variants are resolved whereas this VCF has CNV-like information. However, we can make this VCF truvari compatible by running a simple script that sets the SVTYPE/SVLEN.

converter.py

import pysam

vcf = pysam.VariantFile("original_sample_split.vcf")
# Add the new header line to the VCF. SVTYPE is already in the VCF, though it is not used
header = vcf.header.copy()
header.add_line('##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Alternate length - Reference length">')
output = pysam.VariantFile("/dev/stdout", 'w', header=header)

for entry in vcf:
    if entry.alts is None:
        ## Reference monozygotic seems to happen sometimes. We can't alter those
        output.write(entry)
        continue

    # parse the ref/alt alleles
    alt = entry.alts[0]
    alt_copy_number = int(alt[len("<STR"):-1])
    ref_copy_number = entry.info["REF"]

    # The reference and alternate sequences *should* just be some number of motifs
    motif = entry.info['RU']
    motif_len = len(motif)
    ref_len = ref_copy_number * motif_len
    alt_len = alt_copy_number * motif_len

    # We can set the length as the ALT - REF, and when length is < 0 it is a DEL, > 0 an INS
    svlen = alt_len - ref_len

    if svlen > 0:
        svtype = "INS"
    else:
        svtype = "DEL"

    # output    
    entry.translate(header)
    entry.info['SVLEN'] = svlen
    entry.info['SVTYPE'] = svtype
    output.write(entry)

I made some assumptions here, so please check that this is correct behavior before committing to this script. The main point is that this style of VCF is not handled by Truvari and TR expansions/contractions can be represented as insertions/deletions just fine.

Now we just convert the VCF

python converter.py | bgzip > typed_trs.vcf.gz
tabix typed_trs.vcf.gz

For speed, we're just going to subset the benchmark to only the regions analyzed by the caller. This is going to inflate the recall. The correct way to do this would be to intersect ExpansionHunter's bed file of regions it analyzed with the benchmark's regions to create the subset.regions.bed

bedtools intersect -u -a GIABTR.HG002.benchmark.regions.bed.gz -b typed_trs.vcf.gz > subset.regions.bed

Finally, we run truvari with the recommended parameters from the benchmark's README. Note that we also added --pctseq 0 since these are not sequence resolved calls. This also means the variants cannot be run through truvari refine

truvari bench -b GIABTR.HG002.benchmark.vcf.gz --includebed subset.regions.bed \
    -s 5 --pick ac --pctseq 0 -c typed_trs.vcf.gz -o bench_result

Results:

{
    "TP-base": 18,
    "TP-comp": 17,
    "FP": 7,
    "FN": 4,
    "precision": 0.7083333333333334,
    "recall": 0.8181818181818182,
    "f1": 0.7593052109181142,
    "base cnt": 22,
    "comp cnt": 24,
    "TP-comp_TP-gt": 14,
    "TP-comp_FP-gt": 3,
    "TP-base_TP-gt": 14,
    "TP-base_FP-gt": 4,
    "gt_concordance": 0.8235294117647058,
    "gt_matrix": {
        "(1, 0)": {
            "(0, 1)": 3,
            "(1, 0)": 2,
            "(1, 1)": 1
        },
        "(0, 1)": {
            "(0, 1)": 3,
            "(1, 0)": 1,
            "(1, 1)": 2
        },
        "(1, 1)": {
            "(1, 1)": 5
        },
        "(1, None)": {
            "(1, 1)": 1
        }
    }
}
SophieS9 commented 6 months ago

Thanks so much for the detailed reply @ACEnglish! All makes sense!

raphaelbetschart commented 1 day ago

Hi, I'm currently also looking at STRs from ExpansionHunter (DRAGEN version and the standalone). I am a bit confused about why it's necessary to intersect the ExpansionHunter BED file with the GIAB TR BED file. Is this not explicitely handled by truvari refine? Basically I would expect the following two approaches generating the same results: 1:

truvari bench -b GIABTR.HG002.benchmark.vcf.gz --includebed GIABTR.HG002.benchmark.regions.bed.gz \
    -s 5 --pick ac --pctseq 0 -c typed_trs.vcf.gz -o bench_result
truvari refine --use-original-vcfs --regions ExpansionHunter_Catalog.bed bench_result/

2:

bedtools intersect -u -a GIABTR.HG002.benchmark.regions.bed.gz -b  ExpansionHunter_Catalog.bed > subset.regions.bed
truvari bench -b GIABTR.HG002.benchmark.vcf.gz --includebed subset.regions.bed \
    -s 5 --pick ac --pctseq 0 -c typed_trs.vcf.gz -o bench_result
truvari refine --regions subset_regions.bed bench_result/

Is my understanding correct? Best, Raphael

Edit: In the refine wiki page, I found this paragraph: Therefore, you can run bench on the full benchmark's regions (--includebed), and automatically subset to only the regions analyzed by the caller with refine --regions.

ACEnglish commented 1 day ago

Yes, refine does as you've described. The subsetting I was describing was as it relates to the bench output.