gamcil / clinker

Gene cluster comparison figure generator
MIT License
537 stars 72 forks source link

prohibitively slow performance #21

Open marade opened 3 years ago

marade commented 3 years ago

Testing for #9, the good news is using the --compliant switch for PROKKA apparently allows the script to continue beyond where it would previously crash, but then clinker engages in slow, one-thread, pairwise alignment clustering that does not scale well, making it too slow to use for more than a few genomes. A couple simple changes that would alleviate this:

Thanks!

gamcil commented 3 years ago

Nice, I'm glad that the PROKKA flag worked.

Yeah, the performance is slow - honestly it's just because I never really envisioned people wanting to use it for entire genomes so I didn't think to do any parallelisation. I'd like to add it in the future, but even then, as you say, memory usage will still be a problem. I think adding an --alignment flag or something like your second suggestion is probably the best option. What does a multi alignment typically look like for you? Could you format it to e.g. Sequence One - Sequence Two - Identity - Similarity ?

marade commented 3 years ago

I think what you have in mind might better be termed an all-to-all alignment, where every sequence is compared to every other one and an identity / sim value is assigned to each pair? If that's the case, you might want to consider using distance matrices as input. Then you could take input from lots of different programs, e.g.

https://github.com/kdmurray91/kwip

https://github.com/burkhard-morgenstern/FSWM

https://alurulab.cc.gatech.edu/phylo

gamcil commented 3 years ago

Cool, all-to-all alignment sounds right. Similarity isn't so important, so would realistically just need a distance matrix of identity or some other 0-1 score. I'll just have to have a look at common formats, though I think a simple newline-separated one-two-score type file will probably be the easiest way.

Also, hopefully I can merge https://github.com/gamcil/clinker/pull/22 soon, which will add multiprocessing for alignments within clinker itself.

marade commented 3 years ago

Distance matrices standardly look like this:

                sample1 sample2 sample3
  sample1            0    2.32    3.32
  sample2        3.45       0    1.24
  sample3        3.33    6.32       0

So you always have zeros (or 1s) on the diagonal. There are libraries for these:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html

marade commented 3 years ago

It sounds like you are making good progress. Looking forward to the results.

gamcil commented 3 years ago

Multiprocessing has been added with v0.0.10 using the -j/--jobs argument. Still thinking about the distance matrix bit but didn't make it into this release.

matrs commented 3 years ago

I think that adding some information about the scope/intentions and performance of clinker to the the main README.md, could save time for people potentially interested in using this tool.

arghya1611 commented 2 years ago

I have to add here that unfortunately I am finding the performance of Clinker to be extremely poor. A pairwise alignment between two genomes with 8 CPUs and more than 250GB of RAM available should not take 15 hours (and its still not done). There are multiple issues with the tool, including a poor README/doc, but the performance is truly prohibitive of its use. Would have really liked to use it but can't!

hyphaltip commented 2 years ago

@arghya1611 If you want to do whole genome alignment and visualization you should likely task a different tool. As the title describes it is for gene cluster visualization tool, not necessarily a whole genome synteny program.

As this is an open source, unfunded tool, contributions to code are always valued!