OndrejSladky / kmercamel

KmerCamel🐫 provides implementations of several algorithms for efficiently representing a set of k-mers as a masked superstring.
MIT License
11 stars 2 forks source link

Suboptimal C++ string handling degrades the performance of Ahomaton-based algs #35

Closed karel-brinda closed 11 months ago

karel-brinda commented 1 year ago

It seems to me that the GreedyAC alg somehow became much slower than before.

In the current version, I'm getting

$ time kmercamel/kmers -k 15  -p <(xzcat spneumoniae.fa.xz) -a greedyAC > /dev/null

real    0m36.825s
user    0m9.733s
sys 0m21.072s

If I use canonical k-mers (the -c option, it becomes even worse

$ time kmercamel/kmers -c -k 15  -p <(xzcat spneumoniae.fa.xz) -a greedyAC > /dev/null

real    9m23.860s
user    0m29.113s
sys 2m14.312s

At least the non-canonical version used to be much faster (eg with k-mers @ fda2f7e), see https://github.com/karel-brinda/superstring-kmers-paper/blob/ed981bf861c6ab7efe020ca5b49d5d4507d0dd7f/experiments/05_kmer_camel_comparison_v1/results/immortalite.time.tsv#L159

karel-brinda commented 1 year ago

In the non-canonical case, it's really weird that the user time is so much smaller than the real time.

This could suggest eg a bug in FASTA parsing or an incorrect use of some C++ STL containers (eg doing frequent re-allocation of memory).

karel-brinda commented 1 year ago

Interestingly, when I re-ran it now, the results were much better. Not sure why...

$ time kmercamel/kmers -k 15  -p <(xzcat spneumoniae.fa.xz) -a greedyAC > /dev/null 

real    0m8.708s
user    0m8.158s
sys 0m0.513s

$ time kmercamel/kmers -k 15 -c  -p <(xzcat spneumoniae.fa.xz) -a greedyAC > /dev/null 

real    0m22.984s
user    0m21.210s
sys 0m1.464s
karel-brinda commented 1 year ago

I have some additional observations. This phenomenon happens when Snakemake runs multiple instances of kmer-camel at the same time and each of them takes eg 5 GB (I have 16 GB RAM so it should comfortably fit in).

Seems that kmer-camel does some weird memory reallocations – it probably involves frequent moves of large memory blocks. Such things can happen eg when C++ strings are used and the space is not properly allocated right from the beginning and if short strings are frequently appended, it has too allocate larger space and move the string there.

In result, performance is significantly downgraded, see the highlighted lines:

image

To avoid this, I'm using only 1 Snakemake thread for everything.

karel-brinda commented 1 year ago

The use of process substitutions vs regular files seems to have no impact on this (note in that in both timing experiments the computational interrupted in the middle to see just the ratio real vs usr, not necessarily he overall time):

immortalite:10_kmer_camel_comparison karel$ time kmercamel/kmers -c -k 26 -d 3 -p <(xzcat spneumoniae.fa.xz | seqtk seq -l 80) -a pseudosimplitigsAC > res1
^C

real    0m54.626s
user    0m30.148s
sys 0m20.689s
$ time kmercamel/kmers -c -k 26 -d 3 -p spneumoniae.fa -a pseudosimplitigsAC > res2
^C

real    0m50.862s
user    0m27.309s
sys 0m20.258s

Also, in both cases, no other instance of kmer-camel was running.

karel-brinda commented 1 year ago

Another update: The bottleneck is unlikely to be in FASTA parsing, but it's probably something AC-specific. When I run the evaluation without AC, it goes actually quite fast.

karel-brinda commented 1 year ago

Copying here the additional info from our last emails the AC performance degradation:

How it looks like in htop

image

Examples of lines that might be causing this

Zkusil jsem novou verzi a vypadá to, že problém trvá. Kouknul jsem se do zdrojáků a mám pár příkladů míst, kde bych tipoval, že to bude strašně zápasit s přesuny v paměti:

Velice časté rozšiřování napravo (možná lepší by byl seznam a spojit to na konec): https://github.com/GordonHoklinder/kmercamel/blob/f3f36f1e989d25e808696893b0a7d20b3fc6b62e/generalized_simplitigs_ac.h#L121

Trochu tady: https://github.com/GordonHoklinder/kmercamel/blob/f3f36f1e989d25e808696893b0a7d20b3fc6b62e/generalized_simplitigs_ac.h#L100

Tady si myslím, že to může být skoro kritické: https://github.com/GordonHoklinder/kmercamel/blob/f3f36f1e989d25e808696893b0a7d20b3fc6b62e/generalized_simplitigs_ac.h#L114 Obávám se, že tady se musí při každém rozšíření napravo dělat skoro neuvěřitelné alokace & přesuny – v podstatě se při každém rozšíření i o jedno písmenko podle mě kopíruje celý pseudosimplitig.

OndrejSladky commented 11 months ago

This should already be solved via #38 .