Open marade opened 3 years ago
Nice, I'm glad that the PROKKA flag worked.
Yeah, the performance is slow - honestly it's just because I never really envisioned people wanting to use it for entire genomes so I didn't think to do any parallelisation. I'd like to add it in the future, but even then, as you say, memory usage will still be a problem. I think adding an --alignment
flag or something like your second suggestion is probably the best option. What does a multi alignment typically look like for you? Could you format it to e.g. Sequence One - Sequence Two - Identity - Similarity ?
I think what you have in mind might better be termed an all-to-all alignment, where every sequence is compared to every other one and an identity / sim value is assigned to each pair? If that's the case, you might want to consider using distance matrices as input. Then you could take input from lots of different programs, e.g.
https://github.com/kdmurray91/kwip
Cool, all-to-all alignment sounds right. Similarity isn't so important, so would realistically just need a distance matrix of identity or some other 0-1 score. I'll just have to have a look at common formats, though I think a simple newline-separated one-two-score type file will probably be the easiest way.
Also, hopefully I can merge https://github.com/gamcil/clinker/pull/22 soon, which will add multiprocessing for alignments within clinker itself.
Distance matrices standardly look like this:
sample1 sample2 sample3
sample1 0 2.32 3.32
sample2 3.45 0 1.24
sample3 3.33 6.32 0
So you always have zeros (or 1s) on the diagonal. There are libraries for these:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
It sounds like you are making good progress. Looking forward to the results.
Multiprocessing has been added with v0.0.10 using the -j/--jobs
argument. Still thinking about the distance matrix bit but didn't make it into this release.
I think that adding some information about the scope/intentions and performance of clinker
to the the main README.md
, could save time for people potentially interested in using this tool.
I have to add here that unfortunately I am finding the performance of Clinker to be extremely poor. A pairwise alignment between two genomes with 8 CPUs and more than 250GB of RAM available should not take 15 hours (and its still not done). There are multiple issues with the tool, including a poor README/doc, but the performance is truly prohibitive of its use. Would have really liked to use it but can't!
@arghya1611 If you want to do whole genome alignment and visualization you should likely task a different tool. As the title describes it is for gene cluster visualization tool, not necessarily a whole genome synteny program.
As this is an open source, unfunded tool, contributions to code are always valued!
Testing for #9, the good news is using the --compliant switch for PROKKA apparently allows the script to continue beyond where it would previously crash, but then clinker engages in slow, one-thread, pairwise alignment clustering that does not scale well, making it too slow to use for more than a few genomes. A couple simple changes that would alleviate this:
multiprocessing - This alone would vastly improve the performance, though memory is a concern as I see a single thread eating over 8GB RAM when I run hundreds of genomes.
allow users to start with their own multi-alignments - I reckon most of us don't need (or want) clinker to do the alignments for us, because we either already have alignments, or have a faster way to run them. If you don't want to make that possible, I guess I might fork it myself.
Thanks!