different runs return different number of unitigs and distinct color subsets

jermp commented 1 year ago

Hello @Guilucand and @alexandrutomescu. Let me first congratulate with you for this excellent algorithm!

We are currently using GGCAT for building Fulgor, a colored k-mer index based on SSHash (same functionalities offered by Themisto).

I noticed that, when running the same command multiple times, I get small differences in the number of unitigs/color subsets.

For example, if I run:

./fulgor build -l ../test_data/salmonella_10_filenames.txt -o ../test_data/salmonella_10 -k 31 -m 19 -d tmp_dir --verbose --check -t 16

I sometimes get 171 color subsets and 86630 unitigs, or 172 and 86632, or 173 and 86648.

I'm inclined to think this might be an issue with multithreading since I did not notice this if I use 1 thread only. Or, is it just something that can happen because unitigs are not required to be maximal anyway?

(The correctness of Fulgor is not broken as long as the number of kmers stays the same, as it seems to be. But I just wanted you to know this potential issue. Ideally, I'd expect to always get the same number across different runs.)

Thank you very much!

CC: @rob-p.

jermp commented 1 year ago

I'm CCing also @jnalanko who might be interested. Have you also notice this?

I can confirm that different runs of themisto may report different number of color sets, e.g., 4,236,355 vs. 4,236,354, for the same issue.

jnalanko commented 1 year ago

Hi, @jermp. No, I have not noticed this. Thankfully, the correctness of Themisto should not be affected either, as long as the unitigs cover all k-mers at least once, and the colors of the unitigs not wrong.

jermp commented 1 year ago

Yes exactly, both Themisto and Fulgor are correct anyway since the num. of kmers always seem to be correct and consistent. Just letting you know.

Guilucand commented 1 year ago

Hi @jermp, the difference in color subsets is due to a (partially wanted, for performance) race condition when updating the hash table of the colors.

For efficiency and technical reasons, a new color is first written to disk and then added to an in-memory hashmap that is used to deduplicate equal colors, without holding a global lock. So when two kmers of the same colors (never seen before) are processed at the same time, they can get different color indexes (that are in practice referring to the same set of colors) and they can write them before noticing that the other kmer had the same color. To avoid this problem ggcat should get a global lock each time it writes a new color to disk and hold it until the in-memory hashmap is updated (with the index returned by the disk write), slowing down parallel insertions.

In practice, this should not lead to incorrect results, unless you check for color equality only by comparing the color subset index without reading the colormap, also the increase in space for storing (few) duplicate colors is very small, and the computed compacted graph has always the same set of maximal unitigs.

The difference in unitigs counts you're seeing is also due to the problem above since when two adjacent kmers in the same maximal unitig have a different color index (even if it refers to the same color set) they are split into multiple unitigs.

Also, this problem tends to be more visible with very small datasets, where there is a high chance that two kmers processed by different threads at the exact same time share the same color.

If you need the color sets to be always distinct I can try to see if there is a way to ensure that, maybe putting this requirement under an optional feature flag if it hurts performance.

jermp commented 1 year ago

Hi @Guilucand and thank you for confirming. I do not think it is a hard requirement. Indeed correctness is not impacted if we have some duplicated color. Any thoughts here @rob-p?

rob-p commented 1 year ago

I agree it's not a hard requirement, but it would be very nice to have consistency between runs and to know the true number of colors. Differences are small in our example, but maybe could grow larger for very similar pangenomes and tons of threads.

How is color equality checked, currently? As map contention is very likely to be low anyway, perhaps you could try a sharded map like DashMap to avoid almost all lock contention?

Also, thanks for the quick response!

algbio / ggcat

different runs return different number of unitigs and distinct color subsets #24