Open poddarharsh15 opened 9 months ago
Hi - good question. You generally can only use v0.6 on GRCh37. For GRCh38, we have a published GIAB benchmark for challenging medically relevant genes that includes ~200 SVs (https://rdcu.be/cGwVA). We also have a preliminary draft benchmark from the HG002 T2Tv0.9 assembly, which includes more challenging SVs, but we haven't evaluated it much yet so recommend curating some of the FPs and FNs https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/. We are working on a new HG002-T2T assembly-based SV benchmark that we will evaluate in the coming months.
Hello again and thanks you for answering, I was curious to know if I can benchmark .vcf files on v1.0 T2T-HG002 Assembly that is recently published on GIAB website.
Hi, The vcf in the provided link contains the changes that were made between v0.9 of the HG002 Q100 assembly and v1.0. Therefore the vcf is not suitable for benchmarking. We have a new draft benchmark using v1.0 of the assembly that we expect to post of the ftp site this week.
Hi, Thank you for clarifying my previous doubt. However, I'm now more confused. To assist you in helping me, I've attached the results from my latest benchmark using this truth_set: [https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/]. Could you please review the results? I seem to be encountering a high number of false negatives, and I'm wondering if there might be errors in my approach. delly_summary.json Manta_summary.json
I took a quick look at your results. Can you provide the truvari command and the specific files on the ftp site you used for benchmarking to help with interpreting the results. Thanks!
truvari bench -b GRCh38_HG002-T2TQ100-V0.9_stvar.vcf.gz -c delly_sorted.vcf.gz --includebed GRCh38_HG002-T2TQ100-V0.9_stvar.benchmark.bed -o delly_cmrg
Hello, Please find the command line and ftp link.
Hi @poddarharsh15 - thanks for testing out this draft benchmark. We've not evaluated this extensively yet, but I do expect that standard short read-based methods will have much lower recall for this benchmark, because it includes many more challenging SVs, with many in long tandem repeat regions. You also may want to take a look at the FPs to see if it makes sense to change the matching parameters in truvari to make them more lenient
delly.json Manta.json Hi @jzook, I've benchmarked some data on CMRG genes using truvari. The VCF file is available here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz) The BAM files are aligned on GRCh38. Could you please provide insights into these results? Thank you.
Hi @poddarharsh15 - The CMRG SVs include many in tandem repeats and some in segmental duplications, so I expect the high FN rate reflects this. I recommend you examine some of the FPs and FNs in IGV with your bam and vcf alongside the CMRG vcf and a long read bam file like https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/HG002.SequelII.merged_15kb_20kb.pbmm2.GRCh38.haplotag.10x.bam and assembly bam files like https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.012-20231107/dipcall_output/
Hello @jzook, Thank you for your previous clarification. I have another question regarding benchmarking. This time, I am working with NA12878/HG001 data with 20X-30X coverage files. I downloaded multiple fastq files, aligned them to the GRCh37 reference genome using the bwa-mem aligner, and then used several pipelines to generate my.vcf files. For benchmarking, I'm utilizing tuvari bench with the HG002_truth set file. However, I'm consistently experiencing very low recall in my results. Could you please provide any advice or suggestions to improve this? Please find a log file attached to this message for the reference.
Hi @poddarharsh15, HG001 and HG002 are two separate individuals/ genomes. You will want to use vcfs generated from HG002 fastqs / bams when using an HG002 benchmark. Here is a link to where you can find comparable fastqs for HG002 to the ones you used in your analysis for HG001, https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/. Let us know if you have any further questions.
Best! Nate
Thank you for your response. I had a bit of confusion regarding the truth files. As I am benchmarking pipelines specifically designed for Structural Variations (SVs) and not for small INDELs and SNVs, my understanding is that I can only use HG002 data for this purpose. Is my understanding correct?
That is correct. We currently only have SV benchmarks for HG002.
Thank you so much.
Hello Good afternoon, I am currently working on benchmarking insertions (INS) and deletions (DEL) using HG002 truth set files. However, I am encountering difficulties while attempting to bench deletions and insertions separately. Could you kindly provide suggestions on whether it's advisable to bench them separately to achieve better precision and recall? I have attached a log file output generated from truvari-bench for your reference. Thanks in advance.
Log files from dysgupipeline All_svs.txt DEL_log.txt INS_log.txt
log files from Manta pipeline
Hi, You will want to benchmark insertions and deletions together, then parse the travail output to get separate performance metrics for insertions and deletions. You might want to also try Truvari Refine, https://github.com/ACEnglish/truvari/wiki/refine. Which helps account for differences in variant representation and provides more accurate benchmarking results. Truvari refine is compute intensive, see https://github.com/ACEnglish/truvari/issues/182 and https://github.com/ACEnglish/truvari/issues/175, so it can run for awhile or hang. Another option is hap-eval from Sentieon, https://github.com/Sentieon/hap-eval, which seems to better handle comparing complex SVs with different representations. We also find it helpful to manually curate (view in IGV along with bams for multiple technologies) a random subset of FPs and FNs to see if the discrepancy is an error in the query or truth callset or an artifact of the benchmarking method.
Hope this helps.
Hello, Thank you for your response. As I am relatively new to these studies, I am a bit confused about what you mean by "performing separate performance metrics." Does this involve separating deletions and insertions (using bcftools) from base and call files and then benchmarking them individually? For your reference, I have attached a table generated using several SV callers. I am also using truvari refine to improve benchmark results on HG002.
Any insights you can provide would be greatly appreciated. bench_GIABv0.6.ods
Hi @nate-d-olson, I am conducting a new benchmarking test using the CMRG truth set with several SV-callers and aligners. I have a query regarding how SVTYPE is described in the .VCF files of the CMRG truth set. Is it necessary to use any special parameters while utilizing benchmarking tools such as Truvari and SV-bench? Thank you for your assistance.
INFO=
@poddarharsh15 I have not used sv-bench. I don't use any specific truvari arguments or parameters for the SVTYPE annotation but here is the truvari bench command I use.
truvari bench \
-b {benchmark variants} \
-c {comparison vcf} \
--includebed {benchmark regions} \
--dup-to-ins \
--pick ac \
-o {output_dir}
When comparing the benchmark set to phased variant callsets truvari refine does a nice job comparing complex variants with different representations. Not this step can be compute intensive and slow for whole genome benchmark sets. It will run on unphased vcfs but is compute-intensive and not advised as it is unclear it correctly accounts for differences in variant representations.
truvari refine \
--recount \
--use-region-coords \
--use-original-vcfs \
--reference {ref} \
--regions {candidate refine bed (generated by truvari bench} \
{output_dir}
We are working on a new assembly-based SV benchmark set. Are you willing to share any of the HG002 SV callsets you have generated for us to use as part of our internal evaluations of the new benchmark set?
Thank you @nate-d-olson for your response. I used a similar command line for Truvari as you mentioned, except for running Truvari refine due to the extended runtime and multiple MAFFT errors. Could you please clarify what you mean by HG002 SV callsets? Are you referring to the .VCF files generated from SV-callers that I have used for my study?
Sorry, I was unclear. Yes, I mean the vcfs generated by the different SV-callers you have used. Feel free to email me at nolson@nist.gov to take this offline. Best! Nate
Thank you for clarifying. I would be happy to share the .VCF file, but I need to consult with my supervisor first. I will get back to you as soon as possible. further info I have ran the sv callers on lower coverage WGS (150bp and 250bp) data so it will be around 42 vcf files :(
Hi, I'm currently working on benchmarking VCF files generated from HG002_data(test_run just one sample) for SV calling(Manta, lumpy, GRIDSS, nf-core/sarek) against a truth set. I aligned the BAM files using GRCh38. Any ideas on how to effectively benchmark my results on which truth set? I have one confusion can I use the truth_sets from SV_0.6/ for bench the vcf_files(aligned on GRCh38) generated from SV caller tools. I am using truvari and SVanalyzer for benchmarking. Thank you.