AnyDSL / thorin2

The Higher ORder INtermediate representation - next gen
https://anydsl.github.io/thorin2/
MIT License
46 stars 9 forks source link

optimization exhibits non-deterministic behavior #143

Open NeuralCoder3 opened 1 year ago

NeuralCoder3 commented 1 year ago

Sometimes, the behavior of the optimization pipeline seems to be non-deterministic.

Example: ./build/bin/thorin -d mem -o - lit/mem/no_mem.thorin -VVVV in https://github.com/NeuralCoder3/thorin2/tree/ad_ptr_merge 702d848

The issue might be due to the add_mem optimization, the pipeline builder, or an underlying bug in thorin.

This behavior might also be a side effect of the previous (not merged yet) changes to mem and clos conv with long-reaching impact that did not manifest up to now.

leissa commented 1 year ago

Yes. this is super annoying. Another source is this:

world.app(emit1(), emit2());

It's implementation defined whether emit1() is happened first or second. This code has different behavior on different compilers/OS's.

I have implemented the --trace-gids switch that we could somehow use to test for this in our CI.

NeuralCoder3 commented 1 year ago

The issue happens only sometimes on with the same executable on the same computer in the same cirumstances. Therefore, timing issues or randomness might be the cause.

Probably related issue: ./build/bin/thorin -d matrix -d affine lit/matrix/mapReduce_mult.thorin -o - -VVVV in matrix_dialect f3a3def sometimes generates thorin code and sometimes prints the following error:

:4294967295: error: cannot pass argument 
  '(__806508#2:(.Idx 3), ‹__806508#2:(.Idx 3); .Idx 4294967296›, 0)' of type 
  '[.Nat, «__806508#2:(.Idx 3); ★», .Nat]' to 
  '%mem.lea' of domain 
  '[n_834521: .Nat, _834535: «n_836768; ★», _834540: .Nat]'

which seems odd to me as the arguments are of the style

(n, <n; T>; 0)

which should be the type

[n:.Nat, <<n; *>>; .Nat]

which should agree with lea.

leissa commented 1 year ago

Was fighting this issue in #184 as a Debug build produced different outputs as the Release one

As mentioned above --trace-gids and --reeval-breakpoints helped me tracking down the problem. We could probably write a test case with some non-trivial code, run it with --trace-gids and double-check in our CI that all builds produce the same output.

leissa commented 1 year ago

While #185 fixes part of this problem, there are still some odd things happening and we need a test case to test for this.