Open LYC-vio opened 1 year ago
Hi, unfortunately I don't have any statistics on this.. I'd say that time complexity should be linear, but you never know until you try it 😄 I could run some tests and get back to you or maybe you can check the numbers on this table (ropebwt2 rows).
You may also find more information on our previous paper: in the supplementary I see that it needed 20 hours to index a ~30x PacBio HiFi sample..
Just a note: the number of threads used by the indexing step is fixed at 4 and cannot be changed. Moreover, when you index a reference genome (so limited number of entries in the .fasta), I think that the current version uses a single thread (indeed, you should see a warning like "Turn off parallelization for this batch as too few strings are left.")
Thank you!
Do you mean the --threads
does not change the actural number of threads SVDSS uses for the index step? Or you are referring to the index thread settings in PingPong?
May I also ask why the thread number is fixed to 4 for the index step? Is that due to memory limit or something else?
you can check the numbers on this table (ropebwt2 rows).
Thanks! I'll check it out
Really appreciate your timely responses
It is somehow related to the ropebwt2 implementation we use. I tried to dig into that some time ago and I found out that it was using 4 additional threads (check here: https://github.com/lh3/ropebwt2/blob/bd8dbd3db2e9e3cff74acc2907c0742c9ebbf033/mrope.c#L287). I don't recall the details now, but it was something like one thread per nucleotide.. but take this with a grain of salt
Don't know if this may help, I just simulated some 100bp-long read samples and these are the results:
#reads | File Size | Time (s) | RAM | Index Size |
---|---|---|---|---|
131 072 | 32M | 2 | 65M | 13M |
524 288 | 125M | 8 | 239M | 44M |
1 048 576 | 249M | 15 | 454M | 81M |
4 194 304 | 998M | 55 | 1.6G | 244M |
8 388 608 | 2.0G | 120 | 3.1G | 409M |
33 554 432 | 7.9G | 586 | 12.1G | 1.4G |
Growth seems almost linear but I don't know if we can fully trust these results 😃
Please, let me know once you have the index, how long it took (if you have that info)
Hi
Sorry for submitting a bunch of new issues at a time. I'm kind of curious about the running time SVDSS needed to index and search on datasets with different sizes, or how much the size of the input will affect the time cost. e.g., I've run the index on a 3G reference genome with thread 10 and it took around 40min, how much time do I need to index a short read data of ~200G?
I've read the corresponding SVDSS paper but did not find evaluations about the time cost, sorry if I missed something.
Thank yoou