Fix sources of nondeterminism in egglog

oflatt commented 1 month ago

@ezrosent, I ended up having to use indexmap in generic join, which might slow down egglog. Should we just merge it for now? I would love to add testing to egglog that somehow catches nondeterminism as well.

yihozhang commented 1 month ago

Can we instead just use a build-time flag which users can use to opt in for determinism?

oflatt commented 1 month ago

That's a good idea, but perhaps determinism should be opt-out?

yihozhang commented 1 month ago

Some thoughts:

In general, when performance and determinism are in conflict, I think performance should be the default option since most users don't care about determinism (e.g., they don't run snapshot tests), and users who need determinism know that they need it.
That being said, in this case, IndexMap may actually help performance since it's faster for iterating through! A quick run of cargo build --release && time target/release/egglog tests/eggcc-extraction.egg 2> /dev/null vs cargo build --release -F nondeterministic && time target/release/egglog tests/eggcc-extraction.egg 2> /dev/null shows that the deterministic version is slightly faster. If this result is consistent then we should just merge this PR without a separate compiler option!
- However, if in the future performance and determinism are in conflict, I'd still prefer the default build to prioritize performance over determinism.
Related to the last one, I noticed that by changing TypeInfo::sorts from HashMap to IndexMap, the time it takes to run eggcc-extraction.egg goes down from 16.6s to 11.8s, which is huge!! This is because one of our bottleneck, as shown by cargo flamegraph, is in eval_lit, which linearly scans the TypeInfo::sorts map. A better implementation should not do a linear scan.

oflatt commented 1 month ago

Reverted the feature, so this should be good to go! Excited to see better performance testing in the future.

Alex-Fischman commented 1 month ago

@oflatt Update to main to get benchmarking!

codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Merging #439 will degrade performances by 14.5%

_{Comparing oflatt-nondeterminism-fix (1b550ab) with main (60342af)}

Summary

❌ 6 regressions ✅ 2 untouched benchmarks

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`oflatt-nondeterminism-fix`	Change
❌	`eggcc-extraction`	4.1 s	4.4 s	-7.45%
❌	`herbie`	282.6 ms	304.8 ms	-7.28%
❌	`lambda`	145.1 ms	159.2 ms	-8.89%
❌	`math-microbenchmark`	3.6 s	4.2 s	-14.5%
❌	`python_array_optimize`	6.1 s	6.9 s	-11.25%
❌	`typeinfer`	408.7 ms	433.5 ms	-5.72%

saulshanabrook commented 1 month ago

I imagine some of the changes on shorter running benchmarks might not be significant - FWIW I just opened a PR to reduce our benchmarks to just the longer more meaningful ones (https://github.com/egraphs-good/egglog/pull/444). Otherwise, it seems like there is too much uncertainty introduced by the indeterminacy of the memory allocator.

The slowdown in the math-microbenchmark is possibly more relevant.

oflatt commented 4 weeks ago

We should try:

using a hash function with a default seed
forbid using hash maps except the one defined in a util file

saulshanabrook commented 4 weeks ago

reference to ahash as fast hash map with ability to set seed https://docs.rs/ahash/latest/ahash/

thaliaarchi commented 4 weeks ago

reference to ahash as fast hash map with ability to set seed https://docs.rs/ahash/latest/ahash/

hashbrown moved from ahash to foldhash in the past release, so that seems to be the new consensus. foldhash has a fixed seed initializer too: foldhash::fast::FixedState::with_seed(0). For comparison, I switched symbol_table (used by egglog) to use this.

oflatt commented 3 weeks ago

Not sure why, but things are still non-deterministic with this change https://github.com/egraphs-good/eggcc/actions/runs/11525853271/job/32088922445?pr=637

thaliaarchi commented 3 weeks ago

Since I updated the hash function in PR#445 Update hashbrown, I'd recommend you rebase onto main. My goal there was to update to the latest versions of the hashing dependencies.

I didn't notice that egglog also uses rustc-hash, so that ought to be updated, since they recently did significant improvements improving both performance and hash strength. It's no longer the Firefox-derived hash function; its collision weaknesses, which were problematic for rustc, have been fixed. Now, rustc-hash is a mildly stripped down version of foldhash, making tradeoffs for usage in a compiler (I've written on the differences between them in my notes). It's probably only worth using one of the two.

Update: fixed in PR#456 Update symbol_table. Max's symbol_table has some upstream changes which haven't been released to crates.io, so the version used by egglog is still on an old version of hashbrown and is also missing performance improvements contributed by someone else. For that, we should either ping him again or switch to a Git dependency on it in Cargo.toml.

ezrosent commented 3 weeks ago

One thought:

Another source of nondeterminism that can happen if you're running more than one instance of egglog in a single process is symbol_table. The numbers symbol_table hands out will not be deterministic (there are various process-global counters being incremented and the like. Different numbers will cause hashmap to store strings or symbols in different spots, hence iteration order will be different too.

oflatt commented 3 weeks ago

@thaliaarchi thanks for the suggestions

@ezrosent hmm yeah, that sounds like it would do it. I assumed wrongly that symbol gen was deterministic.

Sounds like I should make a "nondeterministic" feature flag for now, and we can revisit this question at a later date. The best option is probably to somehow sort the resulting serialized egraph, since that would work even with parallel egglog.

oflatt commented 2 weeks ago

Regression is caused by the swap from hashmap to index maps, which have a slower insertion and lookup speed. To recover the performance, use the nondeterministic flag.

Looking to the future, we'd like to remove the dependency on symbol gen library and use the fixed seed solution.

egraphs-good / egglog