Open maxim-saplin opened 1 year ago
I notice that the first three rounds have a different checksum on JIT too, suggesting that the code is susceptible to differences in rounding of intermediate values.
That's a little worrisome, since the code appears like it should be deterministic. Every operation is a SIMD 32-bit floating point addition or multiplication, or a comparison.
Getting different results anyway suggests either not using the same SIMD operations before and after JIT optimization, having different rounding behavior, not using SIMD before optimizing at all, or just something being bugged.
I get the same slowdown on Windows. For JIT run, I get:
0 Execution Time: 126 87666736
1 Execution Time: 125 87662921
2 Execution Time: 126 87661033
3 Execution Time: 119 87667671
...
It's stable from there.
When compiled with dart compile exe
, I get:
0 Execution Time: 7556 78513692
...
stable all the way through, compiled wioth
(Dart SDK version: 3.2.0-202.0.dev (dev) (Tue Sep 26 21:06:42 2023 -0700) on "windows_x64")
That's a nice 64 times slowdown, and getting the same sum as you. So AOT is predictable and deterministic, and very slow.
Here're a SO thread, a few guys checked assembly and told it is horrible (check comments): https://stackoverflow.com/questions/77201568/dart-simd-extensions-int32x4-float32x4-going-crazy-slow-in-aot-different-re
Regarding the the correctness of the results, what is not OK is the result on Intel, it is way to far from what I would expect (10%). JIC, I've tested same problem with 8 different languages (22 configs) they all differ slightly in precision and check sum: https://github.com/maxim-saplin/mandelbrot/tree/main
And performance in AoT is another problem, relevant for both Intel and Apple Silicon.
The difference in sum is around ~1M, which is on the order of one iteration per pixel. That is a very big difference, bigger than what (I think) can reasonably be explained by rounding errors. Every iteration does a squaring, which should quench low-bit noise rather than amplify it. (If you multiply two 24 bit integers, then keep the top 24 bits of the result, that should be less affected by low-bit changes than addition.)
Since the result changes when the JIT optimizes, it seems the value problem could be the optimization. And indeed, if I run without optimization, I get the expected result. Very slowly.
>dart --optimization_counter_threshold=-1 mandelo.dart
0 Execution Time: 9639 78513692
1 Execution Time: 9654 78513692
...
And ... the bug seems to be in greaterThan
. If I change the comparison line to do it in the other direction:
Int32x4 breakCondition = sum.lessThan(escapeThreshold); // (escapeThreshold).greaterThan(sum);
then the result becomes 78513692 on JIT as well.
It seems to be NaN-related. If I look at the produced values, it turns out that some of the iterations end up with a value of NaN in the comparison, also in the "working" versions.
It seems zzx
and zzy
both overflow to infinity, likely because of continuing to iterate on one coordinate even after it leaves the 2.0-circle, as long as another coordinate is still inside. Then zzx - zzy
becomes NaN at that coordinate.
That propagates into sum
in the next iteration, and (4.0...).greaterThan(NaN...)
should give false (0), but in optimized code it gives true (-1), in the compare mask.
The swapped lessThan
gives the correct result (false).
(Our Float32x4.greaterThan
method is undocumented. Let's assume the same as the corresponding JavaScript method, which returns Bool32x4
. After trying to read that spec, I'm fairly sure that comparing to NaN should give false. We don't give Bool32x4
, instead we give a Int32x4
usable as a bit-mask, so true is all-ones and false is all-zeros. So the result of that greaterThan
should be zero, and it is all-ones in optimized code.)
//cc @rmacnak-google @alexmarkov
The correctness issue has been fixed, the performance issue is still open. https://github.com/dart-lang/sdk/issues/31487 is tracking the need for const constructors so Flutter code could use SIMD.
Hi everyone,
Are there any updates on this issue?
I'm experiencing the same problem: my library, which heavily utilizes SIMD, is extremely slow in AoT mode.
STEPS:
dart mandelbrot.dart
and remember the results (execution time and check sum)dart compile exe mandelbrot.dart
, run and remember the resultEXPECTED:
ACTUAL:
ENVIRONMENTS: