Closed lixiang117423 closed 1 year ago
Thank you very much for developing such an excellent software. I am doing rice pan-genome related work and I am using SVDSS to detect structural variation in rice for the past two days. I encountered a very strange thing: when I extract SFS with bam file after smooth, I can't extract SFS successfully, but when I extract SFS with fastq file directly, I can get a lot of SFS. is this step of smooth necessary? Or can be omitted? The version of SVDSS is v1.0.5 (from Conda), the comparison software is minimap2 (version 2.26-r1175), and samtools version is version 1.17. Thanks!
The smooth log ~/software/svdss/SVDSS_linux_x86-64 search --assemble --threads 70 --index refer.genome.fmd --bam smoothed.selective.bam SVDSS, Structural Variant Discovery from Sample-specific Strings. Mode: search [I] . [I] Restoring index.. [I] Done. [I] BAM input: smoothed.selective.bam [I] Putative SFS extraction enabled. [I] Loading smoothed read ids.. [I] Loaded 1847747 ignored read ids. [I] Loaded 1899 smoothed read ids. [I] Loading first batch.. [I] Extracting SFS strings on 70 threads.. [I] Processed batch 189. Reads so far 1849646. Reads per second: 17126. SFS extracted so far: 1092666. Batches exported: 1. Time: 108 [I] Done. 0 processed. [I] Complete. Runtime: 110 seconds.
Hi,
that's quite strange indeed. I don't see anything wrong in the log you sent, and I'd assume that it worked correctly and you should have a single .sfs
file since it exported 1 batch (with 1092666 SFSs). This is the line:
[I] Processed batch 189. Reads so far 1849646. Reads per second: 17126. SFS extracted so far: 1092666. Batches exported: 1. Time: 108
The issue I see is: a lot of alignments have been tagged as "ignored" by the smoother and not used for SFS extraction
Loaded 1847747 ignored read ids.
This may happen for two reasons:
Honestly, I don't know if this may be an issue or not - it mainly depends on what you expect in your settings.
Anyway, since it's quite fast, I'd ask you to rerun the search step on the smoothed bam setting the work directory and then check again the .sfs
file created in that folder - just to see if I'm correct in saying that you should have at least few SFS:
~/software/svdss/SVDSS_linux_x86-64 search --assemble --threads 70 --index refer.genome.fmd --bam smoothed.selective.bam --workdir GIT-TEST
# ...
ll -h GIT-TEST/*.sfs
wc -l GIT-TEST/*.sfs
On a side note, if you have a BAM, searching SFS from it is the way to go since smoothing will remove a lot of "useless" SFS and will increase precision and make everything faster.
Let me know,
I tested with both PacBio Hi-Fi data and ONT data, neither of which successfully extracted SFS and SV. If I use fastq file to extract SFS directly, I can get about 300 SFS, and I can get about 2,000 SVs. However, the SVs in the article I referenced are negotiated at about 15,000 (Ref: Qin P, Lu H, Du H, et al. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations[J]. Cell, 2021, 184(13): 3542-3558. e16.) I also ran the code as you suggested and got a batch file, but the file is empty and has no information. I'm not sure if it's a problem with my data, I'll try again next week when I get my own sequencing data (PacBio Revio) and let you know the results. Thank you and your team very much.
I see.. If you got a batch, but it's empty, I assume that there is a bug somewhere.. Things that you can try, if you have time:
Thanks
this is the output log for 1 threads: SVDSS search --threads 1 --index refer.genome.fmd --bam smoothed.selective.bam SVDSS, Structural Variant Discovery from Sample-specific Strings. Mode: search [I] . [I] Restoring index.. [I] Done. [I] BAM input: smoothed.selective.bam [I] Putative SFS extraction enabled. [I] Loading smoothed read ids.. [I] Loaded 1847747 ignored read ids. [I] Loaded 1899 smoothed read ids. [I] Loading first batch.. [I] Extracting SFS strings on 1 threads.. [I] Processed batch 187. Reads so far 1849646. Reads per second: 20782. SFS extracted so far: 1091680. Batches exported: 1. Time: 89 [I] Done. 0 processed. [I] Complete. Runtime: 92 seconds.
and the solution_batch_0.sfs file is not empty, about 15M.
And I have used the alpha version to all steps. At the call step, 3 SVs were callled. the log is: [2023-10-06 10:08:56.792] [stderr] [info] Loading reference genome from nip.t2t.fa.. [2023-10-06 10:08:58.524] [stderr] [info] Loading SFSs from specifics.txt.. [2023-10-06 10:08:58.528] [stderr] [info] Loaded 4067 SFSs from 1837 reads. [2023-10-06 10:08:58.539] [stderr] [info] Placing SFSs on reference genome [2023-10-06 10:09:15.425] [stderr] [info] 12/1834/1741 unplaced SFSs. 0 erroneus SFSs. 0 clipped SFSs. [2023-10-06 10:09:15.425] [stderr] [info] Clustering 407 SFSs.. [2023-10-06 10:09:15.425] [stderr] [info] Maximum extended SFS length: 151bp. Using separation distance: 166bp. [2023-10-06 10:09:15.426] [stderr] [info] Extending 405 clusters.. [2023-10-06 10:09:16.146] [stderr] [info] Filtered 0 SFSs. Filtered 287 clusters. Filtered 0 global clusters. [2023-10-06 10:09:16.146] [stderr] [info] Calling SVs from 405 clusters.. [2023-10-06 10:09:16.147] [stderr] [info] 3 SVs before chain filtering. [2023-10-06 10:09:16.147] [stderr] [info] Writing 3 SVs. [2023-10-06 10:09:16.452] [stderr] [info] All done! Runtime: 20 seconds
And a lot of warning messages like: [2023-10-06 10:20:31.767] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:32.246] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:34.114] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:34.550] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:37.154] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check [2023-10-06 10:20:37.408] [stderr] [warning] Alignment filtered due to l_qseq. Why are we here? Please check
Does this mean there is something wrong with my bam file? The raw fastq file is produced by PacBio Hi-Fi.
By the way, I did a test with human genome data and all steps work fine.
The rice genome I used is http://www.ricesuperpir.com/uploads/common/gene_annotation/NIP-T2T.gff3.gz, and the example is ftp://download.big.ac.cn/gsa/CRA004087/CRR279366/CRR279366.fastq.gz.
Please try it if you have the time. Thanks.
I will try more data of plant genome and PacBio file, and let you konw.
I remember I added that if with that print but I never encountered it.. Thanks for the data, I'll definitely check them out
Just to be sure, is this the reference genome you are using: http://www.ricesuperpir.com/uploads/common/genome_sequence/NIP-T2T.fa.gz ?
Yes, the rice T2T genome.
Hi, I had some time to check the sample you were using (CRR279366). I noticed that the sample comes from a Sequel system and error rate is much higher than HiFi reads we tested SVDSS on (e.g. Sequel II).
Here some numbers:
33/1364850 with min_acc=0.02
123647/1241296 with min_acc=0.05
1145688/219255 with min_acc=0.075
1360109/4834 with min_acc=0.1
This means that only 33 alignments have less than 2% of errors (computed in the smoothing step as number of mismatches over number of matches and mismatches). There is a parameter (--min-acc
) to tweak this threshold and retain more alignments for SVs calling, but I would not trust these calls since SVDSS was tested on more accurate reads. If you manage to get some PacBio Revio data and test SVDSS on those, it would be great.
Moreover, I noticed that a lot of alignments have low MAPQ (most of them have MAPQ=0, I used minimap2 to align the sample). Is this somehow expected?
Hey, I had the PacBio Revio data last week, and I have tried the SVDSS pipeline.
Firstly, pnmm2 was used to align the bam file produced by the Revio system to the reference genome and then sort the bam file. Then, I tried the SVDSS v1.0.5 fix, but it did not work. There were no SFS files produced. But the V2 version is OK. I have finally gotten the SVs from my samples.
Great software and pipeline, thanks!
Thank you very much for developing such an excellent software. I am doing rice pan-genome related work and I am using SVDSS to detect structural variation in rice for the past two days. I encountered a very strange thing: when I extract SFS with bam file after smooth, I can't extract SFS successfully, but when I extract SFS with fastq file directly, I can get a lot of SFS. is this step of smooth necessary? Or can be omitted? The version of SVDSS is v1.0.5 (from Conda), the comparison software is minimap2 (version 2.26-r1175), and samtools version is version 1.17. Thanks!