Abnormally slow performance of Dart FFI code. Microbenchmark.

karbiv commented 1 week ago

dart --version

Dart SDK version: 3.6.0-edge.0ccb9e06adb79f25a396bd902fee149a289db35d (main) (Wed Sep 11 11:14:20 2024 +0000) on "linux_x64"

Microbenchmark:

import 'dart:ffi';
import 'package:ffi/ffi.dart';

final class RGBA extends Struct {
  @Uint8()
  external int r, g, b, a;
}

const N = 1920 * 1080;

Pointer<RGBA> imageRampGreen() {
  var f = 255 / (N - 1);
  var img = calloc.allocate<RGBA>(N*4); // arg is num of bytes
  for (var i = 0; i < N; i++) {
    img[i].g = (i * f).toInt();
    img[i].a = 255;
  }
  return img;
}

imageToGrey(Pointer<RGBA> img) {
  for (var i = 0; i < N; i++) {
    var pix = img[i]; // this index access is the slow part
    var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt();
    pix.r = y;
    pix.g = y;
    pix.b = y;
  }  
}

main() {
  var now = DateTime.now();

  var img = imageRampGreen();
  // benchmark
  for (var i = 0; i < 1000; i++) {
    imageToGrey(img); 
  }
  var time = DateTime.now().difference(now);
  print('time: ${time.inMilliseconds} milliseconds');
}

It takes 13.6 seconds to run, when other languages like Luajit/Nim/Go take not more than 3 seconds.

If only img[i] is used without pix variable then it takes 69 seconds - each indexing takes its time:

imageToGrey(Pointer<RGBA> img) {
  for (var i = 0; i < N; i++) {
    var y = (0.3 * img[i].r + 0.59 * img[i].g + 0.11 * img[i].b).toInt();
    img[i].r = y;
    img[i].g = y;
    img[i].b = y;
  }  
}

dart-github-bot commented 1 week ago

Summary: The user reports slow performance of Dart FFI code in a microbenchmark that converts an image to grayscale. The benchmark takes 13.6 seconds in Dart, significantly slower than other languages like Luajit, Nim, and Go. The issue appears to be related to the indexing of Pointer<RGBA> objects, as removing the intermediate pix variable and directly accessing img[i] results in even slower performance (69 seconds).

karbiv commented 1 week ago

Similar issue closed 4 years ago, https://github.com/dart-lang/sdk/issues/43142

Recent commit with optimizations seems to be not directly related, doesn't solve it: https://github.com/dart-lang/sdk/commit/af82cadcec3d2b5a487b6b09244c6afe481f8519

mraleph commented 1 week ago

@karbiv can you provide Go or LuaJIT program for comparison?

Most of the time is actually spent in the toInt (which for some legacy reasons produces boxed integer?) and not in the FFI:

        ;; v88 <- DoubleToInteger:10(v22 T{_Double}) [-9223372036854775808, 9223372036854775807] T{int}
0x7f0d7610a1fa    f2480f2cd9             cvttsd2siq rbx,xmm1
0x7f0d7610a1ff    4889df                 movq rdi,rbx
0x7f0d7610a202    48d1e7                 shlq rdi,1
0x7f0d7610a205    0f803c000000           jo 0x00007f0d7610a247
0x7f0d7610a20b    4803db                 addq rbx,rbx
        ;; ParallelMove rbx <- rbx
        ;; v232 <- UnboxInt64([non-speculative], v88 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x7f0d7610a20e    48d1fb                 sarq rbx,1
0x7f0d7610a211    7305                   jnc 0x00007f0d7610a218
0x7f0d7610a213    488b5c1b08             movq rbx,[rbx+rbx*1+0x8]

  2.16 │      cvtsi2sd  %rbx,%xmm0                                                                                      ▒
  2.26 │      movsd     0x5507(%r15),%xmm2                                                                              ▒
  2.49 │      mulsd     %xmm0,%xmm2                                                                                     ▒
  0.08 │      addsd     %xmm2,%xmm1                                                                                     ▒
 10.67 │      cvttsd2si %xmm1,%rbx                                                                                      ◆
  0.36 │      mov       %rbx,%rdi                                                                                       ▒
 10.29 │      shl       $1,%rdi                                                                                         ▒
 10.49 │    ↓ jo        c7                                                                                              ▒
  0.54 │      add       %rbx,%rbx                                                                                       ▒
  1.47 │8e:   sar       $1,%rbx                                                                                         ▒
  2.97 │    ↓ jae       98                                                                                              ▒
       │      mov       0x8(%rbx,%rbx,1),%rbx                                                                           ▒
  6.26 │98:   mov       %rbx,%rax

karbiv commented 1 week ago

can you provide Go or LuaJIT program for comparison?

Luajit:

local ffi = require("ffi")
ffi.cdef [[
typedef struct { uint8_t red, green, blue, alpha; } rgba_pixel;
]]

local function image_ramp_green(n)
   local img = ffi.new("rgba_pixel[?]", n)
   local f = 255/(n - 1)
   for i = 0, n - 1 do
      img[i].green = i * f
      img[i].alpha = 255
   end
   return img
end

local function image_to_grey(img, n)
   for i = 0, n - 1 do
      local pix = img[i]
      local y = 0.3 * pix.red + 0.59 * pix.green + 0.11 * pix.blue
      pix.red = y
      pix.green = y
      pix.blue = y
   end
end

local function main()
   local start_time = os.clock()

   local N = 1920 * 1080
   local img = image_ramp_green(N)
   for _ = 1, 1000 do
      image_to_grey(img, N)
   end

   local elapsed_time = os.clock() - start_time
   print(elapsed_time)
end

main()

Go doesn't use FFI because it has uint8 type.

type rgba_pixel struct {
    red, green, blue, alpha uint8
}

func image_to_grey(img []rgba_pixel, N int) {
    for i := range N {
        y := uint8(0.3*float32(img[i].red) + 0.59*float32(img[i].green) + 0.11*float32(img[i].blue))
        img[i].red = y
        img[i].green = y
        img[i].blue = y
    }
}

func image_ramp_green(N int) []rgba_pixel {
    img := make([]rgba_pixel, N)
    f := float32(255)/float32(N-1)
    for i := range N {
        img[i].green = uint8(float32(i)*f)
        img[i].alpha = 255
    }

    return img
}

func main () {
    start := time.Now()

    N := 1920*1080
    img := image_ramp_green(N)
    for range 1000 {
        image_to_grey(img, N)
    }

    took := time.Since(start)
    fmt.Printf("%d ms.\n", took.Nanoseconds()/1e6)
}

mraleph commented 1 week ago

@lrhn When I look at VM's double.toInt I see that it does clamping into [kMinInt64, kMaxInt64] range. However I don't see this behavior guaranteed by API doc for toInt or truncate. Is this intentional? Can we break it?

IntegerPtr DoubleToInteger(Zone* zone, double val) {
  if (isinf(val) || isnan(val)) {
    const Array& args = Array::Handle(zone, Array::New(1));
    args.SetAt(0, String::Handle(zone, String::New("Infinity or NaN toInt")));
    Exceptions::ThrowByType(Exceptions::kUnsupported, args);
  }
  int64_t ival = 0;
  if (val <= static_cast<double>(kMinInt64)) {
    ival = kMinInt64;
  } else if (val >= static_cast<double>(kMaxInt64)) {
    ival = kMaxInt64;
  } else {  // Representable in int64_t.
    ival = static_cast<int64_t>(val);
  }
  return Integer::New(ival);
}

lrhn commented 1 week ago

The behavior is documented in num.truncate, which num.toInt refers to.

If the value is greater than the highest representable positive integer, the result is that highest positive integer. If the value is smaller than the highest representable negative integer, the result is that highest negative integer.

Whether we can change it depends on what breaks if we do. I don't know what may depend on the current behavior. I know that I have written code that did, using the max-int value as a sentinel, rather than checking for large values before doing .toInt().

If we change it, what would we change it to? The most efficient change on intel platforms would be to "Whetever cvttsd2si does." (That's -2^63 if the value is not in the -2^63..(2^63)-1 range, which differs from this in allowing non-finite values and giving a different value for large negative numbers.)

I think that'd be resonable change. It's still using a specific marker value for overflow, it's just the same one for positive and negative overflow, and works for non-finites too, so fewer cases. (It's just "is unbiased exponent >= 63, then special value").

(I'm more wary about trying to "truncate to low 64 bits" since doubles are not 2's complement to begin with. I'd want the actual value if it's in the representable range, but if it's not, it's not clear what the "lower 64 bits" is of a large negative double. Or maybe it is, if we just imagine infinite-bits two's complement numbers, like Dart's integers used to be.)

Is the performance problem here the clamping, or is it calling into C code to begin with? If I change the function to:

IntegerPtr DoubleToInteger(Zone* zone, double val) {
  return Integer::New(static_cast<int64_t>(val));
}

it doesn't change the timing at all. It's not the computation that costs.

Could we have an intrinsic for the code? Or is that complicated by the complexity. (If we did a simple cvttsd2si on the value, and only called into C++ code if the result was min-int, then that should handle all the common cases. And even the overflow cases are not that bad:

Load biased exponent (bits 53..63 of 1..64 of the representation of the double).
If biased exponent < 1023 + 1024 then it's finite, one of max-int/min-int based on sign of original (minInt + (~original_bits >>> 63)).
Otherwise throw (fine to call into C++ then).

I guess min-int/max-int don't fit in Smis, so we may want to have pre-allocated values for them.)

karbiv commented 1 week ago

Most of the time is actually spent in the toInt

I thought at first that toInt is used internally somehow.

Here var pix = img[i]; line is an optimization. Without pix at all this benchmark takes 69 seconds. Each added img[i] increases time linearly.

imageToGrey(Pointer<RGBA> img) {
  for (var i = 0; i < N; i++) {
    var pix = img[i]; // this index access is the slow part
    var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt();
    pix.r = y;
    pix.g = y;
    pix.b = y;
  }  
}

The line with var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt(); takes only ~0.5 seconds of all time. If it doesn't contain img[i] of course.

mraleph commented 1 week ago

@karbiv Ah, this seem to only happens in JIT mode, which I did not even bother to measure originally. I measured only in AOT (and on HEAD of the sdk). In AOT:

it does not make a difference whether or not pix is used to cache img[i].
most of the time is spent in toInt.

In JIT we hit an interesting issue: a field guard on _Compound._typedDataBase inhibits allocation sinking of the pointer. You can try running with --no-use-field-guards and the problem will go away. I will take a look and fix this.

karbiv commented 1 week ago

Indeed. I was of the impression that exe is just a wrapper for JIT generated code. Or supposed erroneously that dart run uses AOT target.

But still, Luajit is significantly faster even than AOT. So toInt() it is.

karbiv commented 1 week ago

Abnormal part in JIT is fixed so the issue can be deemed solved.

This microbenchmark works as a replacement for C loop and it's not an intended purpose of Dart. Difference in time for GC languages is max 35%. Acceptable. Ergonomics of FFI tools is more important, ffigen.

mraleph commented 1 week ago

@karbiv I would like to keep this open, because it allowed me to find and fix a number of code quality issues in the backend (see commits above). Some of these I have encountered before but never had time to fix.

I have few more in the pipeline,

[ ] Fix code quality issues for C * x vs x * C (regalloc refuses to keep C in a register across the loop if it occurs on the left hand side). Should be fixed by https://dart-review.googlesource.com/c/sdk/+/385001
[ ] Evaluate if it worth fusing arithmetic into the addressing mode where possible (e.g. accessing img[i].r does multiple arithmetic instructions to compute img[i] address then does a load. On Intel it should be possible to fuse these arithmetic operations into the addressing mode)
[ ] Evaluate how much performance is lost due to simple for loops containing two branches (unconditional backwards one and conditional exit branch) instead of a single one (single conditional branch which is either backwards branch or an exit).
[ ] (stretch goal) evaluate if we can/should use a derived iteration variable in the loop: e.g. move from i to img[i] address instead.

dart-lang / sdk

Abnormally slow performance of Dart FFI code. Microbenchmark. #56705