Open karbiv opened 2 months ago
Summary: The user reports slow performance of Dart FFI code in a microbenchmark that converts an image to grayscale. The benchmark takes 13.6 seconds in Dart, significantly slower than other languages like Luajit, Nim, and Go. The issue appears to be related to the indexing of Pointer<RGBA>
objects, as removing the intermediate pix
variable and directly accessing img[i]
results in even slower performance (69 seconds).
Similar issue closed 4 years ago, https://github.com/dart-lang/sdk/issues/43142
Recent commit with optimizations seems to be not directly related, doesn't solve it: https://github.com/dart-lang/sdk/commit/af82cadcec3d2b5a487b6b09244c6afe481f8519
@karbiv can you provide Go or LuaJIT program for comparison?
Most of the time is actually spent in the toInt
(which for some legacy reasons produces boxed integer?) and not in the FFI:
;; v88 <- DoubleToInteger:10(v22 T{_Double}) [-9223372036854775808, 9223372036854775807] T{int}
0x7f0d7610a1fa f2480f2cd9 cvttsd2siq rbx,xmm1
0x7f0d7610a1ff 4889df movq rdi,rbx
0x7f0d7610a202 48d1e7 shlq rdi,1
0x7f0d7610a205 0f803c000000 jo 0x00007f0d7610a247
0x7f0d7610a20b 4803db addq rbx,rbx
;; ParallelMove rbx <- rbx
;; v232 <- UnboxInt64([non-speculative], v88 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x7f0d7610a20e 48d1fb sarq rbx,1
0x7f0d7610a211 7305 jnc 0x00007f0d7610a218
0x7f0d7610a213 488b5c1b08 movq rbx,[rbx+rbx*1+0x8]
2.16 │ cvtsi2sd %rbx,%xmm0 ▒
2.26 │ movsd 0x5507(%r15),%xmm2 ▒
2.49 │ mulsd %xmm0,%xmm2 ▒
0.08 │ addsd %xmm2,%xmm1 ▒
10.67 │ cvttsd2si %xmm1,%rbx ◆
0.36 │ mov %rbx,%rdi ▒
10.29 │ shl $1,%rdi ▒
10.49 │ ↓ jo c7 ▒
0.54 │ add %rbx,%rbx ▒
1.47 │8e: sar $1,%rbx ▒
2.97 │ ↓ jae 98 ▒
│ mov 0x8(%rbx,%rbx,1),%rbx ▒
6.26 │98: mov %rbx,%rax
can you provide Go or LuaJIT program for comparison?
Luajit:
local ffi = require("ffi")
ffi.cdef [[
typedef struct { uint8_t red, green, blue, alpha; } rgba_pixel;
]]
local function image_ramp_green(n)
local img = ffi.new("rgba_pixel[?]", n)
local f = 255/(n - 1)
for i = 0, n - 1 do
img[i].green = i * f
img[i].alpha = 255
end
return img
end
local function image_to_grey(img, n)
for i = 0, n - 1 do
local pix = img[i]
local y = 0.3 * pix.red + 0.59 * pix.green + 0.11 * pix.blue
pix.red = y
pix.green = y
pix.blue = y
end
end
local function main()
local start_time = os.clock()
local N = 1920 * 1080
local img = image_ramp_green(N)
for _ = 1, 1000 do
image_to_grey(img, N)
end
local elapsed_time = os.clock() - start_time
print(elapsed_time)
end
main()
Go doesn't use FFI because it has uint8 type.
type rgba_pixel struct {
red, green, blue, alpha uint8
}
func image_to_grey(img []rgba_pixel, N int) {
for i := range N {
y := uint8(0.3*float32(img[i].red) + 0.59*float32(img[i].green) + 0.11*float32(img[i].blue))
img[i].red = y
img[i].green = y
img[i].blue = y
}
}
func image_ramp_green(N int) []rgba_pixel {
img := make([]rgba_pixel, N)
f := float32(255)/float32(N-1)
for i := range N {
img[i].green = uint8(float32(i)*f)
img[i].alpha = 255
}
return img
}
func main () {
start := time.Now()
N := 1920*1080
img := image_ramp_green(N)
for range 1000 {
image_to_grey(img, N)
}
took := time.Since(start)
fmt.Printf("%d ms.\n", took.Nanoseconds()/1e6)
}
@lrhn When I look at VM's double.toInt
I see that it does clamping into [kMinInt64, kMaxInt64]
range. However I don't see this behavior guaranteed by API doc for toInt
or truncate
. Is this intentional? Can we break it?
IntegerPtr DoubleToInteger(Zone* zone, double val) {
if (isinf(val) || isnan(val)) {
const Array& args = Array::Handle(zone, Array::New(1));
args.SetAt(0, String::Handle(zone, String::New("Infinity or NaN toInt")));
Exceptions::ThrowByType(Exceptions::kUnsupported, args);
}
int64_t ival = 0;
if (val <= static_cast<double>(kMinInt64)) {
ival = kMinInt64;
} else if (val >= static_cast<double>(kMaxInt64)) {
ival = kMaxInt64;
} else { // Representable in int64_t.
ival = static_cast<int64_t>(val);
}
return Integer::New(ival);
}
The behavior is documented in num.truncate
, which num.toInt
refers to.
If the value is greater than the highest representable positive integer, the result is that highest positive integer. If the value is smaller than the highest representable negative integer, the result is that highest negative integer.
Whether we can change it depends on what breaks if we do. I don't know what may depend on the current behavior.
I know that I have written code that did, using the max-int value as a sentinel, rather than checking for large values before doing .toInt()
.
If we change it, what would we change it to?
The most efficient change on intel platforms would be to "Whetever cvttsd2si
does."
(That's -2^63 if the value is not in the -2^63..(2^63)-1 range, which differs from this in allowing non-finite values and giving a different value for large negative numbers.)
I think that'd be resonable change. It's still using a specific marker value for overflow, it's just the same one for positive and negative overflow, and works for non-finites too, so fewer cases. (It's just "is unbiased exponent >= 63, then special value").
(I'm more wary about trying to "truncate to low 64 bits" since doubles are not 2's complement to begin with. I'd want the actual value if it's in the representable range, but if it's not, it's not clear what the "lower 64 bits" is of a large negative double. Or maybe it is, if we just imagine infinite-bits two's complement numbers, like Dart's integers used to be.)
Is the performance problem here the clamping, or is it calling into C code to begin with? If I change the function to:
IntegerPtr DoubleToInteger(Zone* zone, double val) {
return Integer::New(static_cast<int64_t>(val));
}
it doesn't change the timing at all. It's not the computation that costs.
Could we have an intrinsic for the code? Or is that complicated by the complexity.
(If we did a simple cvttsd2si
on the value, and only called into C++ code if the result was min-int, then that should handle all the common cases. And even the overflow cases are not that bad:
I guess min-int/max-int don't fit in Smis, so we may want to have pre-allocated values for them.)
Most of the time is actually spent in the
toInt
I thought at first that toInt
is used internally somehow.
Here var pix = img[i];
line is an optimization.
Without pix
at all this benchmark takes 69 seconds.
Each added img[i]
increases time linearly.
imageToGrey(Pointer<RGBA> img) {
for (var i = 0; i < N; i++) {
var pix = img[i]; // this index access is the slow part
var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt();
pix.r = y;
pix.g = y;
pix.b = y;
}
}
The line with var y = (0.3 * pix.r + 0.59 * pix.g + 0.11 * pix.b).toInt();
takes only ~0.5 seconds of all time.
If it doesn't contain img[i]
of course.
@karbiv Ah, this seem to only happens in JIT mode, which I did not even bother to measure originally. I measured only in AOT (and on HEAD of the sdk). In AOT:
pix
is used to cache img[i]
. toInt
.In JIT we hit an interesting issue: a field guard on _Compound._typedDataBase
inhibits allocation sinking of the pointer. You can try running with --no-use-field-guards
and the problem will go away. I will take a look and fix this.
Indeed. I was of the impression that exe is just a wrapper for JIT generated code.
Or supposed erroneously that dart run
uses AOT target.
But still, Luajit is significantly faster even than AOT. So toInt()
it is.
Abnormal part in JIT is fixed so the issue can be deemed solved.
This microbenchmark works as a replacement for C loop and it's not an intended purpose of Dart. Difference in time for GC languages is max 35%. Acceptable. Ergonomics of FFI tools is more important, ffigen.
@karbiv I would like to keep this open, because it allowed me to find and fix a number of code quality issues in the backend (see commits above). Some of these I have encountered before but never had time to fix.
I have few more in the pipeline,
C * x
vs x * C
(regalloc refuses to keep C
in a register across the loop if it occurs on the left hand side). Should be fixed by https://dart-review.googlesource.com/c/sdk/+/385001img[i].r
does multiple arithmetic instructions to compute img[i]
address then does a load. On Intel it should be possible to fuse these arithmetic operations into the addressing mode)i
to img[i]
address instead.
dart --version
Dart SDK version: 3.6.0-edge.0ccb9e06adb79f25a396bd902fee149a289db35d (main) (Wed Sep 11 11:14:20 2024 +0000) on "linux_x64"
Microbenchmark:
It takes 13.6 seconds to run, when other languages like Luajit/Nim/Go take not more than 3 seconds.
If only
img[i]
is used withoutpix
variable then it takes 69 seconds - each indexing takes its time: