Open stat-lab opened 1 year ago
Hi, I think that most of the steps of SVDSS can theoretically work with error-prone reads but they are not optimized for them.. Especially the smoothing step (first one), by default it expects cleaner reads (ie hifi reads).. But in any case, I'd not blindly trust the results of SVDSS on ONT or CLR reads since we haven't tested it.
How many calls do you have on chr1? Can you please check the file size of smoothed.selective.bam
in the working directory of SVDSS? Is it roughly the same of the input reads .bam
? My suspect (maybe wrong) is that the smoother is filtering a lot of reads since they do not pass the "accuracy check".
Thank you for your response. For NA12878 ONT reads, the file size of the original bam file used as input is 148.5 GB, and the file size of the smoothed.selective.bam after the smoothing step is 99.2 GB. The number of SV calls in the final output svs_poa.vcf was 1,439, which was only for chr1. Although the smoothing step reduced the reads, the likely problematic step might be the final call step. I show the last part of the log of the call step.
[I] Loaded all chromosomes. [I] Loading assmbled SFS.. [I] Loaded 1016408 SFS strings on 240455 reads. [I] Extending superstrings on 4 threads.. [I] Processed batch 1. Alignments so far: 10000. Alignments per second: 10000. Time: 1 [I] Processed batch 2. Alignments so far: 20000. Alignments per second: 20000. Time: 1 [I] Processed batch 3. Alignments so far: 30000. Alignments per second: 30000. Time: 1 [I] Processed batch 4. Alignments so far: 40000. Alignments per second: 40000. Time: 1 [I] Processed batch 5. Alignments so far: 50000. Alignments per second: 50000. Time: 1 [I] Processed batch 6. Alignments so far: 60000. Alignments per second: 60000. Time: 1 [I] Processed batch 7. Alignments so far: 70000. Alignments per second: 35000. Time: 2 [I] Processed batch 8. Alignments so far: 80000. Alignments per second: 40000. Time: 2 [I] Processed batch 9. Alignments so far: 90000. Alignments per second: 45000. Time: 2 [I] Processed batch 10. Alignments so far: 100000. Alignments per second: 50000. Time: 2 [I] Processed batch 11. Alignments so far: 110000. Alignments per second: 36666. Time: 3 [I] Processed batch 12. Alignments so far: 120000. Alignments per second: 40000. Time: 3 [I] Processed batch 13. Alignments so far: 130000. Alignments per second: 43333. Time: 3 [I] Processed batch 14. Alignments so far: 140000. Alignments per second: 46666. Time: 3 [I] Processed batch 15. Alignments so far: 150000. Alignments per second: 37500. Time: 4 [I] Processed batch 16. Alignments so far: 160000. Alignments per second: 40000. Time: 4 [I] Processed batch 17. Alignments so far: 170000. Alignments per second: 42500. Time: 4 [I] Processed batch 18. Alignments so far: 180000. Alignments per second: 45000. Time: 4 [I] Processed batch 19. Alignments so far: 190000. Alignments per second: 38000. Time: 5 [I] Processed batch 20. Alignments so far: 200000. Alignments per second: 40000. Time: 5 [I] Processed batch 21. Alignments so far: 210000. Alignments per second: 42000. Time: 5 [I] Processed batch 22. Alignments so far: 220000. Alignments per second: 36666. Time: 6 [I] Processed batch 23. Alignments so far: 230000. Alignments per second: 38333. Time: 6 [I] Processed batch 24. Alignments so far: 240000. Alignments per second: 4528. Time: 53 [I] Processed batch 25. Alignments so far: 250000. Alignments per second: 4629. Time: 54 [I] Done. [I] Extension complete. 433658 clipped SFS. [I] Maximum extended-SFS length: 19754 bp. Using separation distance 21729. [I] Retrieved 4965 intervals. [I] Flattening interval clusters.. [I] Analyzing 18506 clusters.. [I] Calling SVs from 4201 clusters.. [I] Extracted 1446 SVs. [I] 1446 SVs before chain filtering. [I] 1439 SVs after chain filtering. [I] Predicted 1439 SVs from extended SFS. [I] Outputting POA alignments to /share/project/sg/kosugi/SV_calls/SVDSS/NA12878_ONT/poa.sam.. [I] Exporting 1439 SV calls to /share/project/sg/kosugi/SV_calls/SVDSS/NA12878_ONT/svs_poa.vcf.. [I] Complete. Runtime: 78 seconds.
The batch number of NA12878 ONT was significantly smaller than that for NA12878 HiFi, as is shown below.
. . [I] Processed batch 264. Alignments so far: 2640000. Alignments per second: 40000. Time: 66 [I] Processed batch 265. Alignments so far: 2650000. Alignments per second: 40151. Time: 66 [I] Processed batch 266. Alignments so far: 2660000. Alignments per second: 39701. Time: 67 [I] Processed batch 267. Alignments so far: 2670000. Alignments per second: 39264. Time: 68 [I] Processed batch 268. Alignments so far: 2680000. Alignments per second: 39411. Time: 68 [I] Processed batch 269. Alignments so far: 2690000. Alignments per second: 38985. Time: 69 [I] Processed batch 270. Alignments so far: 2700000. Alignments per second: 38571. Time: 70 [I] Done. [I] Extension complete. 2250130 clipped SFS. [I] Maximum extended-SFS length: 21382 bp. Using separation distance 23520. [I] Retrieved 42241 intervals. [I] Flattening interval clusters.. [I] Analyzing 78209 clusters.. [I] Processed 30336 clusters so far. Clusters per second: 978. Time: 31 [I] Processed 60076 clusters so far. Clusters per second: 968. Time: 62 [I] Calling SVs from 51586 clusters.. [I] Processed 23300 clusters so far. Cluster per second: 751. Time: 31 [I] Processed 41320 clusters so far. Cluster per second: 666. Time: 62 [I] Extracted 30086 SVs. [I] 30086 SVs before chain filtering. [I] 30013 SVs after chain filtering. [I] Predicted 30013 SVs from extended SFS. [I] Outputting POA alignments to /share/project/sg/kosugi/SV_calls/SVDSS/NA12878_HiFi_2/poa.sam.. [I] Exporting 30013 SV calls to /share/project/sg/kosugi/SV_calls/SVDSS/NA12878_HiFi_2/svs_poa.vcf.. [I] Complete. Runtime: 301 seconds.
I'd say that if you have less batches, the reasons could be:
I don't think that the issue is in the call step since batches are the output of previous steps.. It will be easier for me to check what is going on by running SVDSS by myself. Since you are analyzing the NA12878, I suspect that the sample is publicly available. Is this correct? In case, can you please provide me the link to the ONT dataset?
Thanks
The fastq file of NA12878 ONT reads was obtained from the ENA site, https://www.ebi.ac.uk/ena/browser/view/SRR15058167, where SRR15058167_1.fastq.gz is linked. Please let me know what is the cause of this trouble if you could know.
Thank you
Hi, I ran SVDSS on that sample and the final VCF has calls for all chromosomes. I got 93 batches in the working directory. Can you please check how many solution_batch_*.assembled.sfs
do you have in the working directory?
These the commands I ran:
SVDSS smooth --reference /data/hg38/hg38.chroms.fa --bam /data/NA12878/SRR15058167_1.bam --workdir /data/NA12878/SRR15058167_1.svdss --threads 32
SVDSS search --index /data/hg38/hg38.chroms.fa.fmd --bam /data/NA12878/SRR15058167_1.svdss/smoothed.selective.bam --threads 32 --workdir /data/NA12878/SRR15058167_1.svdss/ --assemble
SVDSS call --reference /data/hg38/hg38.chroms.fa --bam /data/NA12878/SRR15058167_1.svdss/smoothed.selective.bam --threads 32 --workdir /data/NA12878/SRR15058167_1.svdss/ --batches 93
This is the log of SVDSS search
:
...
[I] Putative SFS extraction enabled.
[I] Loading smoothed read ids..
[I] Loaded 4748459 ignored read ids.
[I] Loaded 4284787 smoothed read ids.
[I] Loading first batch..
[I] Extracting SFS strings on 32 threads.. [I] Processed batch 907. Reads so far 9033246. Reads per second: 1217. SFS extracted so far: 974340180. Batches exported: 93. Time: 7421
[I] Done.
And this is the log of SVDSS call
...
[I] Loaded all chromosomes.
[I] Loading assmbled SFS..
[I] Loaded 17856362 SFS strings on 4284778 reads.
[I] Extending superstrings on 32 threads..
[I] Processed batch 430. Alignments so far: 4293120. Alignments per second: 31336. Time: 137
[I] Done.
[I] Extension complete. 7715460 clipped SFS.
[I] Maximum extended-SFS length: 29880 bp. Using separation distance 32868.
[I] Retrieved 66856 intervals.
[I] Flattening interval clusters..
[I] Analyzing 326322 clusters..
[I] Processed 255008 clusters so far. Clusters per second: 2742. Time: 93
[I] Calling SVs from 79061 clusters..
[I] Extracted 63627 SVs. ers so far. Cluster per second: 456. Time: 120
[I] 63627 SVs before chain filtering.
[I] 63355 SVs after chain filtering.
[I] Predicted 63355 SVs from extended SFS.
[I] Outputting POA alignments to /data/NA12878/SRR15058167_1.svdss//poa.sam..
[I] Exporting 63355 SV calls to /data/NA12878/SRR15058167_1.svdss//svs_poa.vcf..
[I] Complete. Runtime: 507 seconds.
Let me know
The number of my batch sfs files was 92. I used GRCh37 for reference and did not --batches option in the call command. My commands: SVDSS smooth --reference hs37d5.fa --bam NA12878_nanopore_30x.minimap2.24.bam --workdir /work/NA12878_ONT --threads 4 SVDSS search --index hs37d5.svdss.index --bam smoothed.selective.bam --workdir /work/NA12878_ONT --threads 4 --assemble SVDSS call --reference hs37d5.fa --bam smoothed.selective.bam --workdir /work/NA12878_ONT --threads 4 --min-sv-length 50 --min-cluster-weight 2
My log of SVDSS search: [I] Restoring index.. [I] Done. [I] BAM input: smoothed.selective.bam [I] Putative SFS extraction enabled. [I] Loading smoothed read ids.. [I] Loaded 4759116 ignored read ids. [I] Loaded 4308315 smoothed read ids. [I] Loading first batch.. [I] Extracting SFS strings on 4 threads.. . . [I] Done. [I] Complete. Runtime: 14882 seconds.
How many calls did you get?
Thank you
I ran it on hg38 so this could be the reason of the different batches number. I got 63355 calls, but I'd not fully trust all these calls (I repeat: I've never tested SVDSS on ONT so I don't have any results on its accuracy).
So the issue is the missing --batches
option. By default, SVDSS call
checks only 4 batches and the user must set the number of batches using that option. So, try to run:
SVDSS call --reference hs37d5.fa --bam smoothed.selective.bam --workdir /work/NA12878_ONT --threads 4 --min-sv-length 50 --min-cluster-weight 2 --batches 92
and it should call SVs from each chromosome.
Let me know if this fixes the issue.
Note to myself: improve CLI and remove batches option.
According to your suggestion, I conducted SVDSS call command with the option --batches 92. The run was successfully completed to yield about 29,000 SVs (>= 50 bp). As you suggested, these SVs may not confident ones, but I will try to examine the accuracy, etc. Thank you very much for your kind helps and suggestions.
Thank you for providing the excellent tool. I tried to use HG002 HiFi and ONT long reads for running SVDSS. Although SVDSS output a complete set of SVs when using HiFi reads, SVDSS using ONT reads resulted in generating SVs for only chr1 without any error messages. Can SVDSS use error-containing long reads such as ONT and PacBio CLR reads?