Parsoa / SVDSS

Improved structural variant discovery in accurate long reads using sample-specific strings (SFS)
MIT License
42 stars 4 forks source link

SVDSS2 output nothing #34

Closed coopsor closed 1 month ago

coopsor commented 1 month ago

Hello, I encountered some bugs while running SVDSS2 according to the Snakefile. The first two steps seem to completed successfully, generating the hg37.fmd(3Gb), smoothed_reads.txt(90Mb), and smoothed.selective.bam(65Gb) files. I then sorted and indexed the smoothed.selective.bam file using samtools. However, when executing the SVDSS search command, it produced an empty file.

The smooth command I used is as follows: SVDSS smooth --reference ./ref/hs37d5.fa --bam ./hifi.bam > ./smoothed.bam

SVDSS, Structural Variant Discovery from Sample-specific Strings. Mode: smooth [I] . [I] Loading reference genome from ./ref/hs37d5.fa.. [I] Extracted 1 with 249250621 bases. [I] Extracted 2 with 243199373 bases. [I] Extracted 3 with 198022430 bases. [I] Extracted 4 with 191154276 bases. [I] Extracted 5 with 180915260 bases. [I] Extracted 6 with 171115067 bases. [I] Extracted 7 with 159138663 bases. [I] Extracted 8 with 146364022 bases. [I] Extracted 9 with 141213431 bases. [I] Extracted 10 with 135534747 bases. [I] Extracted 11 with 135006516 bases. [I] Extracted 12 with 133851895 bases. [I] Extracted 13 with 115169878 bases. [I] Extracted 14 with 107349540 bases. [I] Extracted 15 with 102531392 bases. [I] Extracted 16 with 90354753 bases. [I] Extracted 17 with 81195210 bases. [I] Extracted 18 with 78077248 bases. [I] Extracted 19 with 59128983 bases. [I] Extracted 20 with 63025520 bases. [I] Extracted 21 with 48129895 bases. [I] Extracted 22 with 51304566 bases. [I] Extracted X with 155270560 bases. [I] Extracted Y with 59373566 bases. [I] Extracted MT with 16569 bases. [I] Extracted GL000207.1 with 4262 bases. [I] Extracted GL000226.1 with 15008 bases. [I] Extracted GL000229.1 with 19913 bases. [I] Extracted GL000231.1 with 27386 bases. [I] Extracted GL000210.1 with 27682 bases. [I] Extracted GL000239.1 with 33824 bases. [I] Extracted GL000235.1 with 34474 bases. [I] Extracted GL000201.1 with 36148 bases. [I] Extracted GL000247.1 with 36422 bases. [I] Extracted GL000245.1 with 36651 bases. [I] Extracted GL000197.1 with 37175 bases. [I] Extracted GL000203.1 with 37498 bases. [I] Extracted GL000246.1 with 38154 bases. [I] Extracted GL000249.1 with 38502 bases. [I] Extracted GL000196.1 with 38914 bases. [I] Extracted GL000248.1 with 39786 bases. [I] Extracted GL000244.1 with 39929 bases. [I] Extracted GL000238.1 with 39939 bases. [I] Extracted GL000202.1 with 40103 bases. [I] Extracted GL000234.1 with 40531 bases. [I] Extracted GL000232.1 with 40652 bases. [I] Extracted GL000206.1 with 41001 bases. [I] Extracted GL000240.1 with 41933 bases. [I] Extracted GL000236.1 with 41934 bases. [I] Extracted GL000241.1 with 42152 bases. [I] Extracted GL000243.1 with 43341 bases. [I] Extracted GL000242.1 with 43523 bases. [I] Extracted GL000230.1 with 43691 bases. [I] Extracted GL000237.1 with 45867 bases. [I] Extracted GL000233.1 with 45941 bases. [I] Extracted GL000204.1 with 81310 bases. [I] Extracted GL000198.1 with 90085 bases. [I] Extracted GL000208.1 with 92689 bases. [I] Extracted GL000191.1 with 106433 bases. [I] Extracted GL000227.1 with 128374 bases. [I] Extracted GL000228.1 with 129120 bases. [I] Extracted GL000214.1 with 137718 bases. [I] Extracted GL000221.1 with 155397 bases. [I] Extracted GL000209.1 with 159169 bases. [I] Extracted GL000218.1 with 161147 bases. [I] Extracted GL000220.1 with 161802 bases. [I] Extracted GL000213.1 with 164239 bases. [I] Extracted GL000211.1 with 166566 bases. [I] Extracted GL000199.1 with 169874 bases. [I] Extracted GL000217.1 with 172149 bases. [I] Extracted GL000216.1 with 172294 bases. [I] Extracted GL000215.1 with 172545 bases. [I] Extracted GL000205.1 with 174588 bases. [I] Extracted GL000219.1 with 179198 bases. [I] Extracted GL000224.1 with 179693 bases. [I] Extracted GL000223.1 with 180455 bases. [I] Extracted GL000195.1 with 182896 bases. [I] Extracted GL000212.1 with 186858 bases. [I] Extracted GL000222.1 with 186861 bases. [I] Extracted GL000200.1 with 187035 bases. [I] Extracted GL000193.1 with 189789 bases. [I] Extracted GL000194.1 with 191469 bases. [I] Extracted GL000225.1 with 211173 bases. [I] Extracted GL000192.1 with 547496 bases. [I] Extracted NC_007605 with 171823 bases. [I] Extracted hs37d5 with 35477943 bases. [I] Loading first batch.. [I] Dumping smoothed read ids.. [I] Loaded 9894103 reads. [I] Wrote 9031732 reads. [I] Complete. Runtime: 808 seconds.

The search command I used is as follows: SVDSS search --index ./hg37.fmd --bam ./smoothed.selective.bam --threads 16 > ./output.sfs I received the following output:

SVDSS, Structural Variant Discovery from Sample-specific Strings. Mode: search [I] . [I] Restoring index.. [I] Done. [I] BAM input: ./smoothed.selective.bam [I] Putative SFS extraction enabled. [I] Loading smoothed read ids.. [I] Loaded 6615068 ignored read ids. [I] Loaded 0 smoothed read ids. [I] Loading first batch.. [I] Extracting SFS strings on 16 threads.. [I] Processed batch 1. Reads so far 20000. Reads per second: 20000. SFS extracted so far: 0. Batches exported: 0. [I] Processed batch 2. Reads so far 30000. Reads per second: 30000. SFS extracted so far: 0. Batches exported: 0. [I] Processed batch 3. Reads so far 40000. Reads per second: 20000. SFS extracted so far: 0. Batches exported: 0. [I] Processed batch 4. Reads so far 50000. Reads per second: 25000. SFS extracted so far: 0. Batches exported: 0. [I] Processed batch 5. Reads so far 60000. Reads per second: 20000. SFS extracted so far: 0. Batches exported: 0. ...

The resulting output.sfs file only contains "0 processed." line, and I encountered a similar issue when using the SVDSS v1.5 downloaded via conda.

I look forward to your reply. Thank you!

ldenti commented 1 month ago

Hi, I think you are mixing the SVDSS v1 binary with the command to run the v2. Indeed, the command you are using are for v2 but the log you sent is from v1. For instance, the command to run smoother v2 is:

./SVDSS smooth --reference 22.fa --bam 22.bam --threads 2 > smoothed.bam
samtools index smoothed.bam

and the log should be something like:

[2024-08-26 14:29:46.712] [stderr] [info] Loading reference genome from input/22.fa..
[2024-08-26 14:29:47.060] [stderr] [info] Smoothing alignments on 2 threads..
Alignments processed so far: 23765. Alignments processed per second: 5941. Time: 41
[2024-08-26 14:29:51.762] [stderr] [info] All done! Runtime: 5 seconds 

Just to be sure, can you please rerun SVDSS on the example data? You can find the instruction here. If everything goes right, you should see different log messages (similar to those I sent you).

So the first thing I'd do is to be sure that the SVDSS binary (e.g., $(which SVDSS) --version if you have it in your path) is the v2 one. How did you install SVDSS v2? Please download the binary from the release page) and try again the commands you sent.

Let me know, Luca

coopsor commented 1 month ago

Hi, I think you are mixing the SVDSS v1 binary with the command to run the v2. Indeed, the command you are using are for v2 but the log you sent is from v1. For instance, the command to run smoother v2 is:

./SVDSS smooth --reference 22.fa --bam 22.bam --threads 2 > smoothed.bam
samtools index smoothed.bam

and the log should be something like:

[2024-08-26 14:29:46.712] [stderr] [info] Loading reference genome from input/22.fa..
[2024-08-26 14:29:47.060] [stderr] [info] Smoothing alignments on 2 threads..
Alignments processed so far: 23765. Alignments processed per second: 5941. Time: 41
[2024-08-26 14:29:51.762] [stderr] [info] All done! Runtime: 5 seconds 

Just to be sure, can you please rerun SVDSS on the example data? You can find the instruction here. If everything goes right, you should see different log messages (similar to those I sent you).

So the first thing I'd do is to be sure that the SVDSS binary (e.g., $(which SVDSS) --version if you have it in your path) is the v2 one. How did you install SVDSS v2? Please download the binary from the release page) and try again the commands you sent.

Let me know, Luca

Hi, I think you are mixing the SVDSS v1 binary with the command to run the v2. Indeed, the command you are using are for v2 but the log you sent is from v1. For instance, the command to run smoother v2 is:

./SVDSS smooth --reference 22.fa --bam 22.bam --threads 2 > smoothed.bam
samtools index smoothed.bam

and the log should be something like:

[2024-08-26 14:29:46.712] [stderr] [info] Loading reference genome from input/22.fa..
[2024-08-26 14:29:47.060] [stderr] [info] Smoothing alignments on 2 threads..
Alignments processed so far: 23765. Alignments processed per second: 5941. Time: 41
[2024-08-26 14:29:51.762] [stderr] [info] All done! Runtime: 5 seconds 

Just to be sure, can you please rerun SVDSS on the example data? You can find the instruction here. If everything goes right, you should see different log messages (similar to those I sent you).

So the first thing I'd do is to be sure that the SVDSS binary (e.g., $(which SVDSS) --version if you have it in your path) is the v2 one. How did you install SVDSS v2? Please download the binary from the release page) and try again the commands you sent.

Let me know, Luca

First of all, thank you for your reply, it was very critical. I indeed used the v1.5 version of SVDSS in the virtual environment by default, which led to the creation of an empty SFS file, although I'm not sure why. Now, I have successfully called SVs for the HG002 sample using SVDSS2. However, when comparing with the dipcall-truthset, I encountered the following truvari error:

File "/home/miniconda3/envs/bio/lib/python3.10/site-packages/truvari/bench.py", line 523, in annotate_entry entry.info["MatchId"] = match.matid File "pysam/libcbcf.pyx", line 2621, in pysam.libcbcf.VariantRecordInfo.setitem File "pysam/libcbcf.pyx", line 704, in pysam.libcbcf.bcf_info_set_value File "pysam/libcbcf.pyx", line 609, in pysam.libcbcf.bcf_check_values TypeError: invalid value for Integer format.

I checked the HG002-dipcall-hg19.vcf.gz file in the truthset and found it different from CHM13-dipcall-hg38.vcf.gz. How can I obtain a truthset format similar to CHM13-dipcall-hg38.vcf.gz to facilitate my validation?

ldenti commented 1 month ago

Hi, since I created those VCFs merging the TPs and FNs output by truvari, I suspect that if you run truvari (a second time) on those, truvari is trying to add the same field twice (hence it won't be an integer anymore as indicated in the VCF header).

You can preprocess the VCF you are interested in with:

cat <(zgrep "^#" HG002-dipcall-hg19.vcf.gz | grep -v "##INFO=") <(zgrep -v "^#" HG002-dipcall-hg19.vcf.gz | awk -F"\t" '{OFS=FS}{ $8="." ; print   }') | bgzip > HG002-dipcall-hg19.clean.vcf.gz
tabix -p vcf HG002-dipcall-hg19.clean.vcf.gz

Then you should be able to run truvari on it.

Let me know,

coopsor commented 1 month ago

According your suggestions, I successfully ran SVDSS2 and verified the results. Thank you very much for your kind helps and suggestions.