dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.29k stars 1.59k forks source link

SIMD is broken (Int32x4 and Float32x4) #53662

Open maxim-saplin opened 1 year ago

maxim-saplin commented 1 year ago

STEPS:

  1. Copy the file: https://github.com/maxim-saplin/mandelbrot/blob/main/_optimized/mandelbrot.dart
  2. RUN it in VM dart mandelbrot.dart and remember the results (execution time and check sum)
  3. Build it with AoT dart compile exe mandelbrot.dart, run and remember the result

EXPECTED:

ACTUAL:

// M1 Pro, sum 78513692
// dart mandelbrot.dart - Avg: 93.4ms, StdDev: 1.6119%
// dart compile exe mandelbrot.dart - Avg: 4038.5ms, StdDev: 0.6437%
// Intel
// dart mandelbrot.dart - !! sum 87667671 Avg: 162.9ms, StdDev: 7.5598%
// dart compile exe mandelbrot.dart - sum 78513692, Avg: 8806.0ms, StdDev: 4.4871%

ENVIRONMENTS:

lrhn commented 1 year ago

I notice that the first three rounds have a different checksum on JIT too, suggesting that the code is susceptible to differences in rounding of intermediate values.

That's a little worrisome, since the code appears like it should be deterministic. Every operation is a SIMD 32-bit floating point addition or multiplication, or a comparison.

Getting different results anyway suggests either not using the same SIMD operations before and after JIT optimization, having different rounding behavior, not using SIMD before optimizing at all, or just something being bugged.

I get the same slowdown on Windows. For JIT run, I get:

0         Execution Time: 126                       87666736
1         Execution Time: 125                       87662921
2         Execution Time: 126                       87661033
3         Execution Time: 119                       87667671
...

It's stable from there.

When compiled with dart compile exe, I get:

0         Execution Time: 7556                       78513692
...

stable all the way through, compiled wioth

(Dart SDK version: 3.2.0-202.0.dev (dev) (Tue Sep 26 21:06:42 2023 -0700) on "windows_x64")

That's a nice 64 times slowdown, and getting the same sum as you. So AOT is predictable and deterministic, and very slow.

maxim-saplin commented 1 year ago

Here're a SO thread, a few guys checked assembly and told it is horrible (check comments): https://stackoverflow.com/questions/77201568/dart-simd-extensions-int32x4-float32x4-going-crazy-slow-in-aot-different-re

Regarding the the correctness of the results, what is not OK is the result on Intel, it is way to far from what I would expect (10%). JIC, I've tested same problem with 8 different languages (22 configs) they all differ slightly in precision and check sum: https://github.com/maxim-saplin/mandelbrot/tree/main

And performance in AoT is another problem, relevant for both Intel and Apple Silicon.

lrhn commented 1 year ago

The difference in sum is around ~1M, which is on the order of one iteration per pixel. That is a very big difference, bigger than what (I think) can reasonably be explained by rounding errors. Every iteration does a squaring, which should quench low-bit noise rather than amplify it. (If you multiply two 24 bit integers, then keep the top 24 bits of the result, that should be less affected by low-bit changes than addition.)

Since the result changes when the JIT optimizes, it seems the value problem could be the optimization. And indeed, if I run without optimization, I get the expected result. Very slowly.

>dart --optimization_counter_threshold=-1 mandelo.dart
0         Execution Time: 9639                       78513692
1         Execution Time: 9654                       78513692
...

And ... the bug seems to be in greaterThan. If I change the comparison line to do it in the other direction:

        Int32x4 breakCondition = sum.lessThan(escapeThreshold); // (escapeThreshold).greaterThan(sum);

then the result becomes 78513692 on JIT as well.

It seems to be NaN-related. If I look at the produced values, it turns out that some of the iterations end up with a value of NaN in the comparison, also in the "working" versions.

It seems zzx and zzy both overflow to infinity, likely because of continuing to iterate on one coordinate even after it leaves the 2.0-circle, as long as another coordinate is still inside. Then zzx - zzy becomes NaN at that coordinate. That propagates into sum in the next iteration, and (4.0...).greaterThan(NaN...) should give false (0), but in optimized code it gives true (-1), in the compare mask.

The swapped lessThan gives the correct result (false).

(Our Float32x4.greaterThan method is undocumented. Let's assume the same as the corresponding JavaScript method, which returns Bool32x4. After trying to read that spec, I'm fairly sure that comparing to NaN should give false. We don't give Bool32x4, instead we give a Int32x4 usable as a bit-mask, so true is all-ones and false is all-zeros. So the result of that greaterThan should be zero, and it is all-ones in optimized code.)

a-siva commented 1 year ago

//cc @rmacnak-google @alexmarkov

a-siva commented 1 year ago

The correctness issue has been fixed, the performance issue is still open. https://github.com/dart-lang/sdk/issues/31487 is tracking the need for const constructors so Flutter code could use SIMD.

gyrdym commented 6 months ago

Hi everyone,

Are there any updates on this issue?

I'm experiencing the same problem: my library, which heavily utilizes SIMD, is extremely slow in AoT mode.