Parsoa / SVDSS

Improved structural variant discovery in accurate long reads using sample-specific strings (SFS)
MIT License
42 stars 4 forks source link

Precision in HG002 #36

Open aysegokce opened 4 weeks ago

aysegokce commented 4 weeks ago

Hello,

We’re benchmarking against HG002 within Tier 1 regions using the HiFi sequel data from HPRC. We’re getting good recall but low precision. Do you have any parameter suggestions on how to improve precision?

Here is our result:

TP | FN | FP | Precision | Recall | F1 -- | -- | -- | -- | -- | -- 9387 | 259 | 2890 | 0.76460047 | 0.97314949 | 0.8563609 We are using the latest conda version and running it with default parameters as follows. ``` SVDSS smooth --reference human_g1k_v37.fasta --bam HG002_hg19.bam --threads 32 > smoothed.bam SVDSS search --index human_g1k_v37.fmd --bam smoothed.bam --threads 32 > specifics.txt SVDSS call --reference human_g1k_v37.fasta --bam smoothed.bam --sfs specifics.txt --threads 32 > svdss.vcf ``` We are using truvari(v4.1.0) for the benchmarking: `truvari bench -c svdss.vcf.gz -b HG002_SVs_Tier1_v0.6.vcf.gz --typeignore --dup-to-ins -p 0 -s 50 -S 0 --sizemax 100000000 -o truvari/svdss --passonly --includebed HG002_SVs_Tier1_v0.6.bed` Best Ayse
ldenti commented 3 weeks ago

Hi, the first thing coming to my mind is to change the --min-cluster-weight while calling SVs. By default, it's 2. Maybe you can try to increase it a bit (4 was good iirc when coverage was 30x but I'm not sure). It should increase precision while lowering recall (but not much I'd expect).

Which data did you use from this page? In case, I could try and let you know.

Best, Luca

aysegokce commented 3 weeks ago

Thank you, I'll try that. For the HiFi data, we merged the 15kb and 20kb data and used grch37 as a reference.

Best Ayse