question about benchmark comparison

PacificBiosciences / HiFiCNV

Copy number variant caller and depth visualization utility for PacBio HiFi reads

Other

37 stars 4 forks source link

question about benchmark comparison #20

Closed crazysummerW closed 7 months ago

crazysummerW commented 1 year ago

Hello, I am currently using HifiCNV to test the HG002 sample. I noticed that in your documentation, there are only benchmark comparison results provided, but no specific analysis commands and benchmark files. Now, I would like to perform a benchmark comparison on the test results of my own analysis for HG002. Do you have any suggestions for this?

Looking forward to your reply. Thanks.

ctsa commented 1 year ago

Assessing HiFiCNV on samples containing large known CNVs is a good way to start an evaluation. Many of these are listed in the Gross et al. CNV paper referenced in our own benchmarking exercise, and additional large CNV cell lines are available (eg. https://www.coriell.org/0/sections/Search/Sample_Detail.aspx?Ref=GM06918&PgId=166). HG002 contains only a few deletions between 100-150kb, so isn't an ideal case for benchmarking typical clinical CNV patterns.

The recommended protocol for running HiFiCNV is described in the quickstart guide here:

https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/quickstart.md

As outlined there, an important suggestion is to use an excluded region track -- we provide the regions excluded from the benchmarking exercise in the file cnv.excluded_regions.common_50.hg38.bed.gz and a description of how it was generated here:

https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/aux_data.md#pre-computed-excluded-regions-files

crazysummerW commented 1 year ago

Hi @ctsa Thank you for your response. I may not have expressed myself clearly. I have already performed the analysis on HG002 data from the Pacbio Revio system using HiFiCNV, with the hs37d5.fa reference genome. I noticed that you provided the results for evaluating the performance of HiFiCNV (https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/performance.md). Since the sequencing platform and reference genome used in my analysis are different from yours, I would like to evaluate the results of my own analysis on HG002 using HifiCNV in terms of recall, precision, and F1 score. Do you have any detailed pipelines or software to share for this purpose?

holtjma commented 1 year ago

@crazysummerW Some of what you're looking for is listed here: https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/performance.md#benchmark-comparison

Here are some templates I've copied out of our snakemake testing pipeline that may provide further details:

Truvari: v3.5.0 was used via bioconda, command template:

truvari bench \
            --pctsim 0.0 \
            --pctsize 0.5 \
            --pctovl 0.5 \
            --refdist   1000000000 \
            --sizemax   1000000000 \
            --chunksize 1000000000 \
            -b {input.truth_vcf} \
            -c {input.vcf} \
            -f {params.reference} \
            -o {output.out_folder}

Witty.er was run using docker docker://jjxu/wittyer with this template:

/opt/Wittyer/Wittyer \
            --truthVcf {input.truth_vcf} \
            --inputVcf {input.vcf} \
            --outputDirectory {output.out_folder} \
            --bpd 10000

I think this answers the questions, but let us know if there are some details we missed!

holtjma commented 7 months ago

Closing due to inactivity, feel free to re-open if there are further questions!