Run time estimation? - Githubissues

Parsoa / SVDSS

Improved structural variant discovery in accurate long reads using sample-specific strings (SFS)

MIT License

42 stars 4 forks source link

Run time estimation? #23

Open LYC-vio opened 1 year ago

LYC-vio commented 1 year ago

Sorry for submitting a bunch of new issues at a time. I'm kind of curious about the running time SVDSS needed to index and search on datasets with different sizes, or how much the size of the input will affect the time cost. e.g., I've run the index on a 3G reference genome with thread 10 and it took around 40min, how much time do I need to index a short read data of ~200G?

I've read the corresponding SVDSS paper but did not find evaluations about the time cost, sorry if I missed something.

Thank yoou

ldenti commented 1 year ago

Hi, unfortunately I don't have any statistics on this.. I'd say that time complexity should be linear, but you never know until you try it 😄 I could run some tests and get back to you or maybe you can check the numbers on this table (ropebwt2 rows).

You may also find more information on our previous paper: in the supplementary I see that it needed 20 hours to index a ~30x PacBio HiFi sample..

Just a note: the number of threads used by the indexing step is fixed at 4 and cannot be changed. Moreover, when you index a reference genome (so limited number of entries in the .fasta), I think that the current version uses a single thread (indeed, you should see a warning like "Turn off parallelization for this batch as too few strings are left.")

LYC-vio commented 1 year ago

Thank you!

Do you mean the --threads does not change the actural number of threads SVDSS uses for the index step? Or you are referring to the index thread settings in PingPong?

May I also ask why the thread number is fixed to 4 for the index step? Is that due to memory limit or something else?

you can check the numbers on this table (ropebwt2 rows).

Thanks! I'll check it out

Really appreciate your timely responses

ldenti commented 1 year ago

It is somehow related to the ropebwt2 implementation we use. I tried to dig into that some time ago and I found out that it was using 4 additional threads (check here: https://github.com/lh3/ropebwt2/blob/bd8dbd3db2e9e3cff74acc2907c0742c9ebbf033/mrope.c#L287). I don't recall the details now, but it was something like one thread per nucleotide.. but take this with a grain of salt

ldenti commented 1 year ago

Don't know if this may help, I just simulated some 100bp-long read samples and these are the results:

#reads	File Size	Time (s)	RAM	Index Size
131 072	32M	2	65M	13M
524 288	125M	8	239M	44M
1 048 576	249M	15	454M	81M
4 194 304	998M	55	1.6G	244M
8 388 608	2.0G	120	3.1G	409M
33 554 432	7.9G	586	12.1G	1.4G

Growth seems almost linear but I don't know if we can fully trust these results 😃

Please, let me know once you have the index, how long it took (if you have that info)