Open jorisvandenbossche opened 6 months ago
Looking into the inner loop in both Arrow and numpy, it seems to be quite similar. In numpy, almost all time is spent in _aligned_contig_cast_long_to_double
, which essentially boils down to:
npy_intp N = dimensions[0];
char *src = args[0], *dst = args[1];
while (N--) {
*(npy_double *)dst = ((npy_double)(*(npy_long *)src));
dst += sizeof(npy_double);
src += sizeof(npy_long);
}
and in Arrow, almost all time is spent in CastPrimitive
, which essentially does:
Anybody any insight in why our templated C++ code is so much slower than numpy's C code? Logically it looks very similar, or would there be something in our code that prevents optimizations?
Running perf
on the examples, two quick observations (without much experience with this tool and without looking into detail) that might be relevant:
exc_page_fault
in CastPrimitive
(a bit more than half of the time spent in CastPrimitive
is attributed to that)movsd
in the assembly) is taking relatively a lot more time in the Arrow version compared to numpyWell, this really seems to be a problem with jemalloc actually:
>>> arr = pa.array([42]*1_000_000, type=pa.int64())
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.17 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
8.86 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
545 µs ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
8.16 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
802 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
8.81 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> np_arr = arr.to_numpy()
>>> %timeit np_arr.astype('float64')
514 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Trying to prefer mimalloc in https://github.com/apache/arrow/pull/40875
That said, even with mimalloc we're much slower that Numpy on larger arrays, so there's perhaps another issue (are we allocating a null bitmap?):
>>> arr = pa.array([42]*100_000_000, type=pa.int64())
>>> np_arr = arr.to_numpy()
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
277 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit arr[:10_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
27.6 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
557 µs ± 2.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit np_arr.astype('float64')
126 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit np_arr[:10_000_000].astype('float64')
12.6 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np_arr[:1_000_000].astype('float64')
466 µs ± 4.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
By the way, it seems the mimalloc/jemalloc really occurs for a certain range of sizes, above and below which they seem to have similar perf:
>>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
77 µs ± 955 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
77.1 µs ± 696 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
149 µs ± 744 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
146 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
550 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.21 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
1.89 ms ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
4.92 ms ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
3.76 ms ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
10.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
21.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
22 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
41.6 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
40.4 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
By the way, one can also take a look at the C++ results (with mimalloc):
CastInt64ToDoubleUnsafe/524288/1000 148707 ns 148677 ns 4705 items_per_second=3.52635G/s null_percent=0.1 size=524.288k
CastInt64ToDoubleUnsafe/524288/10 148746 ns 148718 ns 4601 items_per_second=3.5254G/s null_percent=10 size=524.288k
CastInt64ToDoubleUnsafe/524288/2 148716 ns 148681 ns 4709 items_per_second=3.52625G/s null_percent=50 size=524.288k
CastInt64ToDoubleUnsafe/524288/1 149002 ns 148976 ns 4699 items_per_second=3.51927G/s null_percent=100 size=524.288k
CastInt64ToDoubleUnsafe/524288/0 146288 ns 146262 ns 4783 items_per_second=3.58459G/s null_percent=0 size=524.288k
and compare them with a corresponding Python benchmark:
>>> %timeit arr[:524288].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
153 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
and then to Numpy:
>>> np_arr = arr.to_numpy()
>>> %timeit np_arr[:524288].astype('float64')
153 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
They are all very similar (~150 µs per iteration, despite different levels of internal boilerplate).
import numpy as np
import pyarrow as pa
results = []
for exp in range(N,10):
N = 10**exp
arr = np.random.randint(0, 100, size=N)
res = %timeit -o arr.astype("float64")
results.append((N, "numpy", res.average))
arr = pa.array(arr)
for pool in [pa.mimalloc_memory_pool(), pa.jemalloc_memory_pool(), pa.system_memory_pool()]:
res = %timeit -o arr.cast(pa.float64(), safe=False, memory_pool=pool)
results.append((N, f"pyarrow ({pool.backend_name})", res.average))
df = pd.DataFrame(results, columns=["N", "backend", "time"])
import seaborn.objects as so
so.Plot(df, x="N", y="time", color="backend").add(so.Line()).scale(y="log", x="log").show()
It's indeed a specific size of data where jemalloc is performing worse, and it was just that size as I was testing in the OP ..
For larger data, the different allocators give the same result, but it's still around 2x slower than numpy (eg for the 10^9 size, 2.3 vs 1.3s)
Tried running smaller examples through godbolt with gcc 14.1 to see if they produced any instruction differences. I think this replicates the NumPy example:
void foo() {
char *src;
char *dst;
while (1) {
*(double *)dst = ((double)(*(long *)src));
dst += sizeof(double);
src += sizeof(long);
}
}
producing the following asm:
foo:
push rbp
mov rbp, rsp
.L2:
mov rax, QWORD PTR [rbp-8]
mov rax, QWORD PTR [rax]
pxor xmm0, xmm0
cvtsi2sd xmm0, rax
mov rax, QWORD PTR [rbp-16]
movsd QWORD PTR [rax], xmm0
add QWORD PTR [rbp-16], 8
add QWORD PTR [rbp-8], 8
jmp .L2
and this typifies the C++ example:
void foo() {
using OutT = double;
using InT = long;
const InT* in_values;
OutT* out_values;
for (;;) {
*out_values++ = static_cast<OutT>(*in_values++);
}
}
producing the following asm:
foo():
push rbp
mov rbp, rsp
.L2:
mov rax, QWORD PTR [rbp-8]
lea rdx, [rax+8]
mov QWORD PTR [rbp-8], rdx
mov rax, QWORD PTR [rax]
pxor xmm0, xmm0
cvtsi2sd xmm0, rax
mov rax, QWORD PTR [rbp-16]
lea rdx, [rax+8]
mov QWORD PTR [rbp-16], rdx
movsd QWORD PTR [rax], xmm0
jmp .L2
Maybe the processor just handles the add instructions more effectively than the lea / mov combination in the C++ example?
FWIW if you change the C++ code to do the pointer increment in subsequent expressions rather than doing inline, the asm generated matches the C examples:
void foo() {
using OutT = double;
using InT = long;
const InT* in_values;
OutT* out_values;
for (;;) {
*out_values = static_cast<OutT>(*in_values);
out_values++;
in_values++;
}
}
~Maybe there is also the possibility to use memcpy in the case where sizeof(OutT) == sizeof(InT)
to help boost performance?~
Maybe there is also the possibility to use memcpy in the case where
sizeof(OutT) == sizeof(InT)
to help boost performance?
When casting int64 to float64?
Ah sorry - ignore that comment
A simple numeric cast in pyarrow / Arrow C++ seems to be quite a bit slower compared to numpy. Often Arrow does more work (safe casting, handling nulls, ..), but I also see this this for a plain unsafe int64 -> float64 cast of a 1d array without nulsl, where Arrow is almost 4-5x slower than numpy. For a simple operation like that, this seems an unexpected big difference.
The python code I used to compare both:
I see this both with my local development version as with released conda packages.