Parsoa / SVDSS

Improved structural variant discovery in accurate long reads using sample-specific strings (SFS)
MIT License
42 stars 4 forks source link

Empty reads and sfs file #10

Open mnshgl0110 opened 2 years ago

mnshgl0110 commented 2 years ago

Hi,

I am trying to use SVDSS, but it seems that it is not working for me.

refgen=/<FOLDER_PATH>/data/humans/GRCh38_latest_genomic.fna.gz
SVDSS index --threads 50 --fastq $refgen --index refgen.bwt &
SVDSS smooth --bam ../../reads10kb.wm2.merged.bam --workdir . --reference $refgen --threads 50

Here, ../../reads10kb.wm2.merged.bam is sorted and indexed.

The SVDSS smooth command finishes without any obvious error message and I get a smoothed.selective.bam file but the smoothed_reads.txt and the ignored_reads.txt files are empty. Running :

SVDSS search --threads 50 --assemble --index refgen.bwt --bam smoothed.selective.bam --workdir .

produces an empty solution_batch_0.assembled.sfs file. Finally, running:

SVDSS call --threads 50 --reference $refgen --bam smoothed.selective.bam --workdir . --batches 0

results in segmentation fault. Could you please let me know how to fix this?

Best Manish

ldenti commented 2 years ago

Hi, have you sorted and indexed also the smoothed.selective.bam? If not, this can create some issues. But I'm not sure if sorting and indexing the .bam will solve yours.. Try it and let me know,

Moreover, have you run the example successfully?

Best, Luca

mnshgl0110 commented 2 years ago

Hi Luca,

The test data finished without errors.

I re-sorted and re-indexed the smoothed.selective.bam file, but the SVDSS search command is still not finding any SFS and output the log message: Processed batch 351. Reads so far 4110758. Reads per second: 17950. SFS extracted so far: 0. Batches exported: 0.

Also, as I mentioned in the original post, the SVDSS smooth command is producing empty *_reads.txt files. Maybe that provides some hint on what exactly is happening?

Best Manish

ldenti commented 2 years ago

I see. Can you please provide the full log of the smoothing and searching steps?

Moreover, can you please check and report here the file size of the original bam and the one produced by the smoothing procedure?

Finally, in case I cannot figure out the issue, would it be possible to you sharing the data somehow?

Best, Luca

mnshgl0110 commented 2 years ago

Hi Luca,

Here is the output from the smoothing and searching step:

(mgpy3.8) 18:18 goel@dell-node-12 netscratch:svdss$ SVDSS smooth --bam sample.bam --workdir . --reference GRCh38_latest_genomic.fna.gz --threads 60
Ping-pong, comparative genome analysis using sample-specific string detection in accurate long reads.
[I] . 
[I] Loading reference genome from GRCh38_latest_genomic.fna.gz.. 
[I] Loading first batch.. 
[I] Processed batch 11. Reads so far 86747. Reads per second: 14457. Time: 6
[I] Processed bases: 1, num mismatch: 0, mismatch rate: 0, ignored reads: 0
[I] Done. 
[I] Dumping smoothed read ids.. 
[I] Loaded 86747 reads. 
[I] Wrote 86747 reads. 
[I] Complete. Runtime: 35 seconds. 

(mgpy3.8) 18:27 goel@dell-node-12 netscratch:svdss$ SVDSS search --threads 60 --assemble --index refgen.bwt --bam smoothed.selective.sorted.bam --workdir .
Ping-pong, comparative genome analysis using sample-specific string detection in accurate long reads.
[I] . 
[I] Restoring index.. 
[I] Done. 
[I] BAM input: smoothed.selective.sorted.bam 
[I] Putative SFS extraction enabled. 
[I] Loading smoothed read ids.. 
[I] Loaded 0 ignored read ids. 
[I] Loaded 0 smoothed read ids. 
[I] Loading first batch.. 
[I] Extracting SFS strings on 60 threads.. 
[I] Processed batch 9. Reads so far 86747. Reads per second: 21686. SFS extracted so far: 0. Batches exported: 1. Time: 4
[I] Done. 
0 processed.
[I] Complete. Runtime: 8 seconds. 

BAM size:

(mgpy3.8) 18:22 goel@dell-node-12 netscratch:svdss$ ll *bam
-rwxrwx--- 1 goel grp_schneeberger 639M Jun  9 12:20 sample.bam
-rwxrwx--- 1 goel grp_schneeberger 632M Jun 13 18:19 smoothed.selective.bam

Link to download the sample bam file: https://websafe.mpipz.mpg.de/d/0O5BZFBg6X/sample.bam

The reads are for human HG002.

Thanks Manish

ldenti commented 2 years ago

Dear Manish, I fixed the issue. You can find the updated code in the new branch chrnamefix. Please try it and let me know.

Just a note: we had a check to filter unwanted contigs and retain only chromosomes sequences. But it works only with chr1/1 naming convention and not with RefSeq naming convention.. Therefore, without that filter all input contigs (i.e., alt or random contigs) are considered and may interfere with the analysis (in case some reads are mapped to them).

Best, Luca

ldenti commented 2 years ago

Branch merged to master (check v1.0.4).

Closing since it works (at least on data I downloaded from ncbi following RefSeq naming conventions). Reopen please if error persists.

riasc commented 1 year ago

Hi, I'm kinda having the same problem. Testdata works fine. But not my PacBio dataset. Followed the snakefile script. Smoothed BAM is sorted and indexed. As I understand, this is most likely a naming mismatch? Thanks

ldenti commented 1 year ago

Hi, could be :)

Is the smoothed bam non-empty? In the working directory you should find several files named solution_batch_{i}.assembled.sfs, do you have those? Moreover, there should also be a file named smoothed_reads.txt, is it empty? Can you please send the log of the snakemake execution (it should contain the stderr from SVDSS) - you should find it in the .snakemake directory.

Just to be sure, can you also please check the (smoothed) .bam header and the reference .fai first column and send them here?

samtools view -H [.bam]
cut -f1 [.fai]
riasc commented 1 year ago

Hi, sorry for the late reply. This was a problem with my data. However, now it runs through but there is no .vcf file generated. What am I missing? Thanks for your help

SVDSS, Structural Variant Discovery from Sample-specific Strings. Mode: call [I] /projects/results/ [I] PingPong SV Caller running on 30 threads.. [I] Loading reference genome from /GRCh38_chr1-4.fa.. [I] Extracted chr1 with 230481012 bases. [I] Extracted chr2 with 240548228 bases. [I] Extracted chr3 with 198100135 bases. [I] Extracted chr4 with 189752667 bases. [I] Loaded all chromosomes. [I] Loading assmbled SFS.. [I] Loaded 1212 SFS strings on 124 reads. [I] Extending superstrings on 30 threads.. [I] Processed batch 1. Alignments so far: 9990. Alignments per second: 9990. Time: 1 [I] Done. [I] Extension complete. 249 clipped SFS.

ldenti commented 1 year ago

It's strange since the log is incomplete.. This is an example of what you should see:

SVDSS, Structural Variant Discovery from Sample-specific Strings.
Mode: call
[I] svdss-output 
[I] PingPong SV Caller running on 4 threads.. 
[I] Loading reference genome from input/22.fa.. 
[I] Extracted chr22 with 50818468 bases. 
[I] Loaded all chromosomes. 
[I] Loading assmbled SFS.. 
[I] Loaded 20884 SFS strings on 5402 reads. 
[I] Extending superstrings on 4 threads.. 
[I] Processed batch 1. Alignments so far: 10000. Alignments per second: 10000. Time: 1
[I] Done. 
[I] Extension complete. 2194  clipped SFS. 
[I] Maximum extended-SFS length: 8800 bp. Using separation distance 9680. 
[I] Retrieved 850  intervals. 
[I] Flattening interval clusters.. 
[I] Analyzing 1567 clusters.. 
[I] Calling SVs from 1044 clusters.. 
[I] Extracted 957 SVs. 
[I] 957 SVs before chain filtering. 
[I] 956 SVs after chain filtering. 
[I] Predicted 956 SVs from extended SFS. 
[I] Outputting POA alignments to svdss-output/poa.sam.. 
[I] Exporting 956 SV calls to svdss-output/svs_poa.vcf.. 
[I] Complete. Runtime: 4 seconds.

Did it crash? If so, I'd expect some sort of error message.. But without that, it's not easy to troubleshoot the problem.. There is no .vcf in the /projects/results/ folder?

I see that your log says

Processed batch 1. Alignments so far: 9990. Alignments per second: 9990. Time: 1

and not 10000.. but I'm not sure if that caused the problem.

I tried with a very small input and I still got 10000:

...
[I] Loaded all chromosomes. 
[I] Loading assmbled SFS.. 
[I] Loaded 193 SFS strings on 28 reads. 
[I] Extending superstrings on 4 threads.. 
[I] Processed batch 1. Alignments so far: 10000. Alignments per second: 10000. Time: 1
[I] Done.
...

Would it be possible to share the data you are using? This will help me a lot.