apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[C++] Performance of numeric casts #40874

Open jorisvandenbossche opened 6 months ago

jorisvandenbossche commented 6 months ago

A simple numeric cast in pyarrow / Arrow C++ seems to be quite a bit slower compared to numpy. Often Arrow does more work (safe casting, handling nulls, ..), but I also see this this for a plain unsafe int64 -> float64 cast of a 1d array without nulsl, where Arrow is almost 4-5x slower than numpy. For a simple operation like that, this seems an unexpected big difference.

The python code I used to compare both:

import numpy as np
import pyarrow as pa

nrows = 10**5
arr_np = np.random.randn(nrows*20).astype("int64")
arr = pa.array(arr_np)

%timeit arr_np.astype("float64")
# 1.1 ms ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit arr.cast(pa.float64(), safe=False)
# 4.7 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I see this both with my local development version as with released conda packages.

jorisvandenbossche commented 6 months ago

Looking into the inner loop in both Arrow and numpy, it seems to be quite similar. In numpy, almost all time is spent in _aligned_contig_cast_long_to_double, which essentially boils down to:

    npy_intp N = dimensions[0];
    char *src = args[0], *dst = args[1];

    while (N--) {
        *(npy_double *)dst = ((npy_double)(*(npy_long *)src));
        dst += sizeof(npy_double);
        src += sizeof(npy_long);
    }

and in Arrow, almost all time is spent in CastPrimitive, which essentially does:

https://github.com/apache/arrow/blob/cf832b8b5dd91ca1b70519fa544f0a44ebdb3bce/cpp/src/arrow/compute/kernels/scalar_cast_internal.cc#L40-L46

Anybody any insight in why our templated C++ code is so much slower than numpy's C code? Logically it looks very similar, or would there be something in our code that prevents optimizations?

Running perf on the examples, two quick observations (without much experience with this tool and without looking into detail) that might be relevant:

pitrou commented 6 months ago

Well, this really seems to be a problem with jemalloc actually:

>>> arr = pa.array([42]*1_000_000, type=pa.int64())

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.17 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
8.86 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
545 µs ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
8.16 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
802 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
8.81 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> np_arr = arr.to_numpy()
>>> %timeit np_arr.astype('float64')
514 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
pitrou commented 6 months ago

Trying to prefer mimalloc in https://github.com/apache/arrow/pull/40875

pitrou commented 6 months ago

That said, even with mimalloc we're much slower that Numpy on larger arrays, so there's perhaps another issue (are we allocating a null bitmap?):

>>> arr = pa.array([42]*100_000_000, type=pa.int64())
>>> np_arr = arr.to_numpy()

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
277 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit arr[:10_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
27.6 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
557 µs ± 2.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

>>> %timeit np_arr.astype('float64')
126 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit np_arr[:10_000_000].astype('float64')
12.6 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np_arr[:1_000_000].astype('float64')
466 µs ± 4.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
pitrou commented 6 months ago

By the way, it seems the mimalloc/jemalloc really occurs for a certain range of sizes, above and below which they seem to have similar perf:

>>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
77 µs ± 955 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
77.1 µs ± 696 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
149 µs ± 744 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
146 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
550 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.21 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
1.89 ms ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
4.92 ms ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
3.76 ms ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
10.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
21.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
22 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
41.6 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
40.4 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
pitrou commented 6 months ago

By the way, one can also take a look at the C++ results (with mimalloc):

CastInt64ToDoubleUnsafe/524288/1000     148707 ns       148677 ns         4705 items_per_second=3.52635G/s null_percent=0.1 size=524.288k
CastInt64ToDoubleUnsafe/524288/10       148746 ns       148718 ns         4601 items_per_second=3.5254G/s null_percent=10 size=524.288k
CastInt64ToDoubleUnsafe/524288/2        148716 ns       148681 ns         4709 items_per_second=3.52625G/s null_percent=50 size=524.288k
CastInt64ToDoubleUnsafe/524288/1        149002 ns       148976 ns         4699 items_per_second=3.51927G/s null_percent=100 size=524.288k
CastInt64ToDoubleUnsafe/524288/0        146288 ns       146262 ns         4783 items_per_second=3.58459G/s null_percent=0 size=524.288k

and compare them with a corresponding Python benchmark:

>>> %timeit arr[:524288].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
153 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

and then to Numpy:

>>> np_arr = arr.to_numpy()
>>> %timeit np_arr[:524288].astype('float64')
153 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

They are all very similar (~150 µs per iteration, despite different levels of internal boilerplate).

jorisvandenbossche commented 6 months ago
import numpy as np
import pyarrow as pa

results = []
for exp in range(N,10):
    N = 10**exp
    arr = np.random.randint(0, 100, size=N)
    res = %timeit -o arr.astype("float64")
    results.append((N, "numpy", res.average))
    arr = pa.array(arr)
    for pool in [pa.mimalloc_memory_pool(), pa.jemalloc_memory_pool(), pa.system_memory_pool()]:
        res = %timeit -o arr.cast(pa.float64(), safe=False, memory_pool=pool)
        results.append((N, f"pyarrow ({pool.backend_name})", res.average))

df = pd.DataFrame(results, columns=["N", "backend", "time"])

import seaborn.objects as so
so.Plot(df, x="N", y="time", color="backend").add(so.Line()).scale(y="log", x="log").show()

Figure_1

It's indeed a specific size of data where jemalloc is performing worse, and it was just that size as I was testing in the OP ..

For larger data, the different allocators give the same result, but it's still around 2x slower than numpy (eg for the 10^9 size, 2.3 vs 1.3s)

WillAyd commented 4 months ago

Tried running smaller examples through godbolt with gcc 14.1 to see if they produced any instruction differences. I think this replicates the NumPy example:

void foo() {
    char *src;
    char *dst;
    while (1) {
        *(double *)dst = ((double)(*(long *)src));
        dst += sizeof(double);
        src += sizeof(long);
    }
}

producing the following asm:

foo:
        push    rbp
        mov     rbp, rsp
.L2:
        mov     rax, QWORD PTR [rbp-8]
        mov     rax, QWORD PTR [rax]
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, rax
        mov     rax, QWORD PTR [rbp-16]
        movsd   QWORD PTR [rax], xmm0
        add     QWORD PTR [rbp-16], 8
        add     QWORD PTR [rbp-8], 8
        jmp     .L2

and this typifies the C++ example:

void foo() {
    using OutT = double; 
    using InT = long; 
    const InT* in_values; 
    OutT* out_values; 
    for (;;) { 
        *out_values++ = static_cast<OutT>(*in_values++); 
    } 
 }

producing the following asm:

foo():
        push    rbp
        mov     rbp, rsp
.L2:
        mov     rax, QWORD PTR [rbp-8]
        lea     rdx, [rax+8]
        mov     QWORD PTR [rbp-8], rdx
        mov     rax, QWORD PTR [rax]
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, rax
        mov     rax, QWORD PTR [rbp-16]
        lea     rdx, [rax+8]
        mov     QWORD PTR [rbp-16], rdx
        movsd   QWORD PTR [rax], xmm0
        jmp     .L2

Maybe the processor just handles the add instructions more effectively than the lea / mov combination in the C++ example?

FWIW if you change the C++ code to do the pointer increment in subsequent expressions rather than doing inline, the asm generated matches the C examples:

void foo() {
    using OutT = double; 
    using InT = long; 
    const InT* in_values; 
    OutT* out_values; 
    for (;;) { 
        *out_values = static_cast<OutT>(*in_values); 
        out_values++;
        in_values++;
    } 
 }
WillAyd commented 4 months ago

~Maybe there is also the possibility to use memcpy in the case where sizeof(OutT) == sizeof(InT) to help boost performance?~

pitrou commented 4 months ago

Maybe there is also the possibility to use memcpy in the case where sizeof(OutT) == sizeof(InT) to help boost performance?

When casting int64 to float64?

WillAyd commented 4 months ago

Ah sorry - ignore that comment