genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
232 stars 71 forks source link

Truth_set information for benchmarking #27

Open poddarharsh15 opened 9 months ago

poddarharsh15 commented 9 months ago

Hi, I'm currently working on benchmarking VCF files generated from HG002_data(test_run just one sample) for SV calling(Manta, lumpy, GRIDSS, nf-core/sarek) against a truth set. I aligned the BAM files using GRCh38. Any ideas on how to effectively benchmark my results on which truth set? I have one confusion can I use the truth_sets from SV_0.6/ for bench the vcf_files(aligned on GRCh38) generated from SV caller tools. I am using truvari and SVanalyzer for benchmarking. Thank you.

jzook commented 9 months ago

Hi - good question. You generally can only use v0.6 on GRCh37. For GRCh38, we have a published GIAB benchmark for challenging medically relevant genes that includes ~200 SVs (https://rdcu.be/cGwVA). We also have a preliminary draft benchmark from the HG002 T2Tv0.9 assembly, which includes more challenging SVs, but we haven't evaluated it much yet so recommend curating some of the FPs and FNs https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/. We are working on a new HG002-T2T assembly-based SV benchmark that we will evaluate in the coming months.

poddarharsh15 commented 9 months ago

Hello again and thanks you for answering, I was curious to know if I can benchmark .vcf files on v1.0 T2T-HG002 Assembly that is recently published on GIAB website.

nate-d-olson commented 9 months ago

Hi, The vcf in the provided link contains the changes that were made between v0.9 of the HG002 Q100 assembly and v1.0. Therefore the vcf is not suitable for benchmarking. We have a new draft benchmark using v1.0 of the assembly that we expect to post of the ftp site this week.

poddarharsh15 commented 9 months ago

Hi, Thank you for clarifying my previous doubt. However, I'm now more confused. To assist you in helping me, I've attached the results from my latest benchmark using this truth_set: [https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/]. Could you please review the results? I seem to be encountering a high number of false negatives, and I'm wondering if there might be errors in my approach. delly_summary.json Manta_summary.json

nate-d-olson commented 9 months ago

I took a quick look at your results. Can you provide the truvari command and the specific files on the ftp site you used for benchmarking to help with interpreting the results. Thanks!

poddarharsh15 commented 9 months ago

truvari bench -b GRCh38_HG002-T2TQ100-V0.9_stvar.vcf.gz -c delly_sorted.vcf.gz --includebed GRCh38_HG002-T2TQ100-V0.9_stvar.benchmark.bed -o delly_cmrg

(https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/GRCh38_HG002-T2TQ100-V0.9_stvar.vcf.gz)

Hello, Please find the command line and ftp link.

jzook commented 9 months ago

Hi @poddarharsh15 - thanks for testing out this draft benchmark. We've not evaluated this extensively yet, but I do expect that standard short read-based methods will have much lower recall for this benchmark, because it includes many more challenging SVs, with many in long tandem repeat regions. You also may want to take a look at the FPs to see if it makes sense to change the matching parameters in truvari to make them more lenient

poddarharsh15 commented 9 months ago

delly.json Manta.json Hi @jzook, I've benchmarked some data on CMRG genes using truvari. The VCF file is available here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz) The BAM files are aligned on GRCh38. Could you please provide insights into these results? Thank you.

jzook commented 9 months ago

Hi @poddarharsh15 - The CMRG SVs include many in tandem repeats and some in segmental duplications, so I expect the high FN rate reflects this. I recommend you examine some of the FPs and FNs in IGV with your bam and vcf alongside the CMRG vcf and a long read bam file like https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/HG002.SequelII.merged_15kb_20kb.pbmm2.GRCh38.haplotag.10x.bam and assembly bam files like https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.012-20231107/dipcall_output/

poddarharsh15 commented 8 months ago

Hello @jzook, Thank you for your previous clarification. I have another question regarding benchmarking. This time, I am working with NA12878/HG001 data with 20X-30X coverage files. I downloaded multiple fastq files, aligned them to the GRCh37 reference genome using the bwa-mem aligner, and then used several pipelines to generate my.vcf files. For benchmarking, I'm utilizing tuvari bench with the HG002_truth set file. However, I'm consistently experiencing very low recall in my results. Could you please provide any advice or suggestions to improve this? Please find a log file attached to this message for the reference.

log.txt

nate-d-olson commented 8 months ago

Hi @poddarharsh15, HG001 and HG002 are two separate individuals/ genomes. You will want to use vcfs generated from HG002 fastqs / bams when using an HG002 benchmark. Here is a link to where you can find comparable fastqs for HG002 to the ones you used in your analysis for HG001, https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/. Let us know if you have any further questions.

Best! Nate

poddarharsh15 commented 8 months ago

Thank you for your response. I had a bit of confusion regarding the truth files. As I am benchmarking pipelines specifically designed for Structural Variations (SVs) and not for small INDELs and SNVs, my understanding is that I can only use HG002 data for this purpose. Is my understanding correct?

nate-d-olson commented 8 months ago

That is correct. We currently only have SV benchmarks for HG002.

poddarharsh15 commented 8 months ago

Thank you so much.

poddarharsh15 commented 8 months ago

Hello Good afternoon, I am currently working on benchmarking insertions (INS) and deletions (DEL) using HG002 truth set files. However, I am encountering difficulties while attempting to bench deletions and insertions separately. Could you kindly provide suggestions on whether it's advisable to bench them separately to achieve better precision and recall? I have attached a log file output generated from truvari-bench for your reference. Thanks in advance.

Log files from dysgupipeline All_svs.txt DEL_log.txt INS_log.txt

log files from Manta pipeline

All_sv.txt DEL.txt INS.txt

nate-d-olson commented 7 months ago

Hi, You will want to benchmark insertions and deletions together, then parse the travail output to get separate performance metrics for insertions and deletions. You might want to also try Truvari Refine, https://github.com/ACEnglish/truvari/wiki/refine. Which helps account for differences in variant representation and provides more accurate benchmarking results. Truvari refine is compute intensive, see https://github.com/ACEnglish/truvari/issues/182 and https://github.com/ACEnglish/truvari/issues/175, so it can run for awhile or hang. Another option is hap-eval from Sentieon, https://github.com/Sentieon/hap-eval, which seems to better handle comparing complex SVs with different representations. We also find it helpful to manually curate (view in IGV along with bams for multiple technologies) a random subset of FPs and FNs to see if the discrepancy is an error in the query or truth callset or an artifact of the benchmarking method.

Hope this helps.

poddarharsh15 commented 7 months ago

Hello, Thank you for your response. As I am relatively new to these studies, I am a bit confused about what you mean by "performing separate performance metrics." Does this involve separating deletions and insertions (using bcftools) from base and call files and then benchmarking them individually? For your reference, I have attached a table generated using several SV callers. I am also using truvari refine to improve benchmark results on HG002.

Any insights you can provide would be greatly appreciated. bench_GIABv0.6.ods

poddarharsh15 commented 3 months ago

Hi @nate-d-olson, I am conducting a new benchmarking test using the CMRG truth set with several SV-callers and aligners. I have a query regarding how SVTYPE is described in the .VCF files of the CMRG truth set. Is it necessary to use any special parameters while utilizing benchmarking tools such as Truvari and SV-bench? Thank you for your assistance.

INFO=

CMRG_1

nate-d-olson commented 3 months ago

@poddarharsh15 I have not used sv-bench. I don't use any specific truvari arguments or parameters for the SVTYPE annotation but here is the truvari bench command I use.

        truvari bench \
            -b {benchmark variants} \
            -c {comparison vcf} \
            --includebed {benchmark regions} \
            --dup-to-ins \
            --pick ac \
            -o {output_dir} 

When comparing the benchmark set to phased variant callsets truvari refine does a nice job comparing complex variants with different representations. Not this step can be compute intensive and slow for whole genome benchmark sets. It will run on unphased vcfs but is compute-intensive and not advised as it is unclear it correctly accounts for differences in variant representations.

        truvari refine \
            --recount \
            --use-region-coords \
            --use-original-vcfs \
            --reference {ref} \
            --regions {candidate refine bed (generated by truvari bench} \
            {output_dir}

We are working on a new assembly-based SV benchmark set. Are you willing to share any of the HG002 SV callsets you have generated for us to use as part of our internal evaluations of the new benchmark set?

poddarharsh15 commented 3 months ago

Thank you @nate-d-olson for your response. I used a similar command line for Truvari as you mentioned, except for running Truvari refine due to the extended runtime and multiple MAFFT errors. Could you please clarify what you mean by HG002 SV callsets? Are you referring to the .VCF files generated from SV-callers that I have used for my study?

nate-d-olson commented 3 months ago

Sorry, I was unclear. Yes, I mean the vcfs generated by the different SV-callers you have used. Feel free to email me at nolson@nist.gov to take this offline. Best! Nate

poddarharsh15 commented 3 months ago

Thank you for clarifying. I would be happy to share the .VCF file, but I need to consult with my supervisor first. I will get back to you as soon as possible. further info I have ran the sv callers on lower coverage WGS (150bp and 250bp) data so it will be around 42 vcf files :(