dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.28k stars 1.58k forks source link

Abnormally slow performance of Dart FFI code. Microbenchmark. #56705

Open karbiv opened 2 months ago

karbiv commented 2 months ago

dart --version

Dart SDK version: 3.6.0-edge.0ccb9e06adb79f25a396bd902fee149a289db35d (main) (Wed Sep 11 11:14:20 2024 +0000) on "linux_x64"

Microbenchmark:

import 'dart:ffi';
import 'package:ffi/ffi.dart';

final class RGBA extends Struct {
  @Uint8()
  external int r, g, b, a;
}

const N = 1920 * 1080;

Pointer<RGBA> imageRampGreen() {
  var f = 255 / (N - 1);
  var img = calloc.allocate<RGBA>(N*4); // arg is num of bytes
  for (var i = 0; i < N; i++) {
    img[i].g = (i * f).toInt();
    img[i].a = 255;
  }
  return img;
}

imageToGrey(Pointer<RGBA> img) {
  for (var i = 0; i < N; i++) {
    var pix = img[i]; // this index access is the slow part
    var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt();
    pix.r = y;
    pix.g = y;
    pix.b = y;
  }  
}

main() {
  var now = DateTime.now();

  var img = imageRampGreen();
  // benchmark
  for (var i = 0; i < 1000; i++) {
    imageToGrey(img); 
  }
  var time = DateTime.now().difference(now);
  print('time: ${time.inMilliseconds} milliseconds');
}

It takes 13.6 seconds to run, when other languages like Luajit/Nim/Go take not more than 3 seconds.

If only img[i] is used without pix variable then it takes 69 seconds - each indexing takes its time:

imageToGrey(Pointer<RGBA> img) {
  for (var i = 0; i < N; i++) {
    var y = (0.3 * img[i].r + 0.59 * img[i].g + 0.11 * img[i].b).toInt();
    img[i].r = y;
    img[i].g = y;
    img[i].b = y;
  }  
}
dart-github-bot commented 2 months ago

Summary: The user reports slow performance of Dart FFI code in a microbenchmark that converts an image to grayscale. The benchmark takes 13.6 seconds in Dart, significantly slower than other languages like Luajit, Nim, and Go. The issue appears to be related to the indexing of Pointer<RGBA> objects, as removing the intermediate pix variable and directly accessing img[i] results in even slower performance (69 seconds).

karbiv commented 2 months ago

Similar issue closed 4 years ago, https://github.com/dart-lang/sdk/issues/43142

Recent commit with optimizations seems to be not directly related, doesn't solve it: https://github.com/dart-lang/sdk/commit/af82cadcec3d2b5a487b6b09244c6afe481f8519

mraleph commented 2 months ago

@karbiv can you provide Go or LuaJIT program for comparison?

Most of the time is actually spent in the toInt (which for some legacy reasons produces boxed integer?) and not in the FFI:

        ;; v88 <- DoubleToInteger:10(v22 T{_Double}) [-9223372036854775808, 9223372036854775807] T{int}
0x7f0d7610a1fa    f2480f2cd9             cvttsd2siq rbx,xmm1
0x7f0d7610a1ff    4889df                 movq rdi,rbx
0x7f0d7610a202    48d1e7                 shlq rdi,1
0x7f0d7610a205    0f803c000000           jo 0x00007f0d7610a247
0x7f0d7610a20b    4803db                 addq rbx,rbx
        ;; ParallelMove rbx <- rbx
        ;; v232 <- UnboxInt64([non-speculative], v88 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x7f0d7610a20e    48d1fb                 sarq rbx,1
0x7f0d7610a211    7305                   jnc 0x00007f0d7610a218
0x7f0d7610a213    488b5c1b08             movq rbx,[rbx+rbx*1+0x8]
  2.16 │      cvtsi2sd  %rbx,%xmm0                                                                                      ▒
  2.26 │      movsd     0x5507(%r15),%xmm2                                                                              ▒
  2.49 │      mulsd     %xmm0,%xmm2                                                                                     ▒
  0.08 │      addsd     %xmm2,%xmm1                                                                                     ▒
 10.67 │      cvttsd2si %xmm1,%rbx                                                                                      ◆
  0.36 │      mov       %rbx,%rdi                                                                                       ▒
 10.29 │      shl       $1,%rdi                                                                                         ▒
 10.49 │    ↓ jo        c7                                                                                              ▒
  0.54 │      add       %rbx,%rbx                                                                                       ▒
  1.47 │8e:   sar       $1,%rbx                                                                                         ▒
  2.97 │    ↓ jae       98                                                                                              ▒
       │      mov       0x8(%rbx,%rbx,1),%rbx                                                                           ▒
  6.26 │98:   mov       %rbx,%rax 
karbiv commented 2 months ago

can you provide Go or LuaJIT program for comparison?

Luajit:

local ffi = require("ffi")
ffi.cdef [[
typedef struct { uint8_t red, green, blue, alpha; } rgba_pixel;
]]

local function image_ramp_green(n)
   local img = ffi.new("rgba_pixel[?]", n)
   local f = 255/(n - 1)
   for i = 0, n - 1 do
      img[i].green = i * f
      img[i].alpha = 255
   end
   return img
end

local function image_to_grey(img, n)
   for i = 0, n - 1 do
      local pix = img[i]
      local y = 0.3 * pix.red + 0.59 * pix.green + 0.11 * pix.blue
      pix.red = y
      pix.green = y
      pix.blue = y
   end
end

local function main()
   local start_time = os.clock()

   local N = 1920 * 1080
   local img = image_ramp_green(N)
   for _ = 1, 1000 do
      image_to_grey(img, N)
   end

   local elapsed_time = os.clock() - start_time
   print(elapsed_time)
end

main()

Go doesn't use FFI because it has uint8 type.

type rgba_pixel struct {
    red, green, blue, alpha uint8
}

func image_to_grey(img []rgba_pixel, N int) {
    for i := range N {
        y := uint8(0.3*float32(img[i].red) + 0.59*float32(img[i].green) + 0.11*float32(img[i].blue))
        img[i].red = y
        img[i].green = y
        img[i].blue = y
    }
}

func image_ramp_green(N int) []rgba_pixel {
    img := make([]rgba_pixel, N)
    f := float32(255)/float32(N-1)
    for i := range N {
        img[i].green = uint8(float32(i)*f)
        img[i].alpha = 255
    }

    return img
}

func main () {
    start := time.Now()

    N := 1920*1080
    img := image_ramp_green(N)
    for range 1000 {
        image_to_grey(img, N)
    }

    took := time.Since(start)
    fmt.Printf("%d ms.\n", took.Nanoseconds()/1e6)
}
mraleph commented 2 months ago

@lrhn When I look at VM's double.toInt I see that it does clamping into [kMinInt64, kMaxInt64] range. However I don't see this behavior guaranteed by API doc for toInt or truncate. Is this intentional? Can we break it?

IntegerPtr DoubleToInteger(Zone* zone, double val) {
  if (isinf(val) || isnan(val)) {
    const Array& args = Array::Handle(zone, Array::New(1));
    args.SetAt(0, String::Handle(zone, String::New("Infinity or NaN toInt")));
    Exceptions::ThrowByType(Exceptions::kUnsupported, args);
  }
  int64_t ival = 0;
  if (val <= static_cast<double>(kMinInt64)) {
    ival = kMinInt64;
  } else if (val >= static_cast<double>(kMaxInt64)) {
    ival = kMaxInt64;
  } else {  // Representable in int64_t.
    ival = static_cast<int64_t>(val);
  }
  return Integer::New(ival);
}
lrhn commented 2 months ago

The behavior is documented in num.truncate, which num.toInt refers to.

If the value is greater than the highest representable positive integer, the result is that highest positive integer. If the value is smaller than the highest representable negative integer, the result is that highest negative integer.

Whether we can change it depends on what breaks if we do. I don't know what may depend on the current behavior. I know that I have written code that did, using the max-int value as a sentinel, rather than checking for large values before doing .toInt().

If we change it, what would we change it to? The most efficient change on intel platforms would be to "Whetever cvttsd2si does." (That's -2^63 if the value is not in the -2^63..(2^63)-1 range, which differs from this in allowing non-finite values and giving a different value for large negative numbers.)

I think that'd be resonable change. It's still using a specific marker value for overflow, it's just the same one for positive and negative overflow, and works for non-finites too, so fewer cases. (It's just "is unbiased exponent >= 63, then special value").

(I'm more wary about trying to "truncate to low 64 bits" since doubles are not 2's complement to begin with. I'd want the actual value if it's in the representable range, but if it's not, it's not clear what the "lower 64 bits" is of a large negative double. Or maybe it is, if we just imagine infinite-bits two's complement numbers, like Dart's integers used to be.)

Is the performance problem here the clamping, or is it calling into C code to begin with? If I change the function to:

IntegerPtr DoubleToInteger(Zone* zone, double val) {
  return Integer::New(static_cast<int64_t>(val));
}

it doesn't change the timing at all. It's not the computation that costs.

Could we have an intrinsic for the code? Or is that complicated by the complexity. (If we did a simple cvttsd2si on the value, and only called into C++ code if the result was min-int, then that should handle all the common cases. And even the overflow cases are not that bad:

I guess min-int/max-int don't fit in Smis, so we may want to have pre-allocated values for them.)

karbiv commented 2 months ago

Most of the time is actually spent in the toInt

I thought at first that toInt is used internally somehow.

Here var pix = img[i]; line is an optimization. Without pix at all this benchmark takes 69 seconds. Each added img[i] increases time linearly.

imageToGrey(Pointer<RGBA> img) {
  for (var i = 0; i < N; i++) {
    var pix = img[i]; // this index access is the slow part
    var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt();
    pix.r = y;
    pix.g = y;
    pix.b = y;
  }  
}

The line with var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt(); takes only ~0.5 seconds of all time. If it doesn't contain img[i] of course.

mraleph commented 2 months ago

@karbiv Ah, this seem to only happens in JIT mode, which I did not even bother to measure originally. I measured only in AOT (and on HEAD of the sdk). In AOT:

In JIT we hit an interesting issue: a field guard on _Compound._typedDataBase inhibits allocation sinking of the pointer. You can try running with --no-use-field-guards and the problem will go away. I will take a look and fix this.

karbiv commented 2 months ago

Indeed. I was of the impression that exe is just a wrapper for JIT generated code. Or supposed erroneously that dart run uses AOT target.

But still, Luajit is significantly faster even than AOT. So toInt() it is.

karbiv commented 2 months ago

Abnormal part in JIT is fixed so the issue can be deemed solved.

This microbenchmark works as a replacement for C loop and it's not an intended purpose of Dart. Difference in time for GC languages is max 35%. Acceptable. Ergonomics of FFI tools is more important, ffigen.

mraleph commented 2 months ago

@karbiv I would like to keep this open, because it allowed me to find and fix a number of code quality issues in the backend (see commits above). Some of these I have encountered before but never had time to fix.

I have few more in the pipeline,