empty SFS with smoothed bam file

lixiang117423 commented 1 year ago

Thank you very much for developing such an excellent software. I am doing rice pan-genome related work and I am using SVDSS to detect structural variation in rice for the past two days. I encountered a very strange thing: when I extract SFS with bam file after smooth, I can't extract SFS successfully, but when I extract SFS with fastq file directly, I can get a lot of SFS. is this step of smooth necessary? Or can be omitted? The version of SVDSS is v1.0.5 (from Conda), the comparison software is minimap2 (version 2.26-r1175), and samtools version is version 1.17. Thanks!

lixiang117423 commented 1 year ago

Thank you very much for developing such an excellent software. I am doing rice pan-genome related work and I am using SVDSS to detect structural variation in rice for the past two days. I encountered a very strange thing: when I extract SFS with bam file after smooth, I can't extract SFS successfully, but when I extract SFS with fastq file directly, I can get a lot of SFS. is this step of smooth necessary? Or can be omitted? The version of SVDSS is v1.0.5 (from Conda), the comparison software is minimap2 (version 2.26-r1175), and samtools version is version 1.17. Thanks!

The smooth log ~/software/svdss/SVDSS_linux_x86-64 search --assemble --threads 70 --index refer.genome.fmd --bam smoothed.selective.bam SVDSS, Structural Variant Discovery from Sample-specific Strings. Mode: search [I] . [I] Restoring index.. [I] Done. [I] BAM input: smoothed.selective.bam [I] Putative SFS extraction enabled. [I] Loading smoothed read ids.. [I] Loaded 1847747 ignored read ids. [I] Loaded 1899 smoothed read ids. [I] Loading first batch.. [I] Extracting SFS strings on 70 threads.. [I] Processed batch 189. Reads so far 1849646. Reads per second: 17126. SFS extracted so far: 1092666. Batches exported: 1. Time: 108 [I] Done. 0 processed. [I] Complete. Runtime: 110 seconds.

ldenti commented 1 year ago

Hi, that's quite strange indeed. I don't see anything wrong in the log you sent, and I'd assume that it worked correctly and you should have a single .sfs file since it exported 1 batch (with 1092666 SFSs). This is the line:

[I] Processed batch 189. Reads so far 1849646. Reads per second: 17126. SFS extracted so far: 1092666. Batches exported: 1. Time: 108

The issue I see is: a lot of alignments have been tagged as "ignored" by the smoother and not used for SFS extraction

Loaded 1847747 ignored read ids.

This may happen for two reasons:

alignments are very dirty (e.g., ONT reads) - What kind of data (sequencing technology) are you using?
alignments do not exhibit any SV - how many SVs do you expect?

Honestly, I don't know if this may be an issue or not - it mainly depends on what you expect in your settings.

Anyway, since it's quite fast, I'd ask you to rerun the search step on the smoothed bam setting the work directory and then check again the .sfs file created in that folder - just to see if I'm correct in saying that you should have at least few SFS:

~/software/svdss/SVDSS_linux_x86-64 search --assemble --threads 70 --index refer.genome.fmd --bam smoothed.selective.bam --workdir GIT-TEST
# ...
ll -h GIT-TEST/*.sfs
wc -l GIT-TEST/*.sfs

On a side note, if you have a BAM, searching SFS from it is the way to go since smoothing will remove a lot of "useless" SFS and will increase precision and make everything faster.

Let me know,

lixiang117423 commented 1 year ago

I tested with both PacBio Hi-Fi data and ONT data, neither of which successfully extracted SFS and SV. If I use fastq file to extract SFS directly, I can get about 300 SFS, and I can get about 2,000 SVs. However, the SVs in the article I referenced are negotiated at about 15,000 (Ref: Qin P, Lu H, Du H, et al. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations[J]. Cell, 2021, 184(13): 3542-3558. e16.) I also ran the code as you suggested and got a batch file, but the file is empty and has no information. I'm not sure if it's a problem with my data, I'll try again next week when I get my own sequencing data (PacBio Revio) and let you know the results. Thank you and your team very much.

ldenti commented 1 year ago

I see.. If you got a batch, but it's empty, I assume that there is a bug somewhere.. Things that you can try, if you have time:

run the search step with 1 thread only
try the new alpha version (both smooth and search step)

Thanks

lixiang117423 commented 1 year ago

this is the output log for 1 threads: SVDSS search --threads 1 --index refer.genome.fmd --bam smoothed.selective.bam SVDSS, Structural Variant Discovery from Sample-specific Strings. Mode: search [I] . [I] Restoring index.. [I] Done. [I] BAM input: smoothed.selective.bam [I] Putative SFS extraction enabled. [I] Loading smoothed read ids.. [I] Loaded 1847747 ignored read ids. [I] Loaded 1899 smoothed read ids. [I] Loading first batch.. [I] Extracting SFS strings on 1 threads.. [I] Processed batch 187. Reads so far 1849646. Reads per second: 20782. SFS extracted so far: 1091680. Batches exported: 1. Time: 89 [I] Done. 0 processed. [I] Complete. Runtime: 92 seconds.

and the solution_batch_0.sfs file is not empty, about 15M.

And I have used the alpha version to all steps. At the call step, 3 SVs were callled. the log is: [2023-10-06 10:08:56.792] [stderr] [info] Loading reference genome from nip.t2t.fa.. [2023-10-06 10:08:58.524] [stderr] [info] Loading SFSs from specifics.txt.. [2023-10-06 10:08:58.528] [stderr] [info] Loaded 4067 SFSs from 1837 reads. [2023-10-06 10:08:58.539] [stderr] [info] Placing SFSs on reference genome [2023-10-06 10:09:15.425] [stderr] [info] 12/1834/1741 unplaced SFSs. 0 erroneus SFSs. 0 clipped SFSs. [2023-10-06 10:09:15.425] [stderr] [info] Clustering 407 SFSs.. [2023-10-06 10:09:15.425] [stderr] [info] Maximum extended SFS length: 151bp. Using separation distance: 166bp. [2023-10-06 10:09:15.426] [stderr] [info] Extending 405 clusters.. [2023-10-06 10:09:16.146] [stderr] [info] Filtered 0 SFSs. Filtered 287 clusters. Filtered 0 global clusters. [2023-10-06 10:09:16.146] [stderr] [info] Calling SVs from 405 clusters.. [2023-10-06 10:09:16.147] [stderr] [info] 3 SVs before chain filtering. [2023-10-06 10:09:16.147] [stderr] [info] Writing 3 SVs. [2023-10-06 10:09:16.452] [stderr] [info] All done! Runtime: 20 seconds

And a lot of warning messages like: [2023-10-06 10:20:31.767] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:32.246] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:34.114] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:34.550] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:37.154] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:37.408] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check

Does this mean there is something wrong with my bam file? The raw fastq file is produced by PacBio Hi-Fi.

lixiang117423 commented 1 year ago

By the way, I did a test with human genome data and all steps work fine.

The rice genome I used is http://www.ricesuperpir.com/uploads/common/gene_annotation/NIP-T2T.gff3.gz, and the example is ftp://download.big.ac.cn/gsa/CRA004087/CRR279366/CRR279366.fastq.gz.

Please try it if you have the time. Thanks.

lixiang117423 commented 1 year ago

I will try more data of plant genome and PacBio file, and let you konw.

ldenti commented 1 year ago

I remember I added that if with that print but I never encountered it.. Thanks for the data, I'll definitely check them out

ldenti commented 1 year ago

Just to be sure, is this the reference genome you are using: http://www.ricesuperpir.com/uploads/common/genome_sequence/NIP-T2T.fa.gz ?

lixiang117423 commented 1 year ago

Yes, the rice T2T genome.

ldenti commented 1 year ago

Hi, I had some time to check the sample you were using (CRR279366). I noticed that the sample comes from a Sequel system and error rate is much higher than HiFi reads we tested SVDSS on (e.g. Sequel II).

Here some numbers:

33/1364850      with min_acc=0.02
123647/1241296  with min_acc=0.05
1145688/219255  with min_acc=0.075 
1360109/4834    with min_acc=0.1

This means that only 33 alignments have less than 2% of errors (computed in the smoothing step as number of mismatches over number of matches and mismatches). There is a parameter (--min-acc) to tweak this threshold and retain more alignments for SVs calling, but I would not trust these calls since SVDSS was tested on more accurate reads. If you manage to get some PacBio Revio data and test SVDSS on those, it would be great.

Moreover, I noticed that a lot of alignments have low MAPQ (most of them have MAPQ=0, I used minimap2 to align the sample). Is this somehow expected?

lixiang117423 commented 1 year ago

Hey, I had the PacBio Revio data last week, and I have tried the SVDSS pipeline.

Firstly, pnmm2 was used to align the bam file produced by the Revio system to the reference genome and then sort the bam file. Then, I tried the SVDSS v1.0.5 fix, but it did not work. There were no SFS files produced. But the V2 version is OK. I have finally gotten the SVs from my samples.

Great software and pipeline, thanks!

Parsoa / SVDSS

empty SFS with smoothed bam file #29