Closed eth42 closed 1 year ago
I can't immediately see where the slowness might come from. How many elements are you reading at once? Is the slowness dependent on the chunk size?
If you could run flamegraph or share the dataset that would be great in helping us debug!
The chunk size does not appear to have an impact, the times for both approaches scale linearly with the chunk size. I reworked my code to solely use the pyo3 version, but I can run some benchmarks in the coming days. Will post here again once I have a flamegraph.
So here is a flamegraph then. I read the first 100k rows (~150MB) of f16 from the file and returned them as an f32 ndarray. The call to pyo3 was first and the call to hdf5 second, so if caching has anything to do with it, hdf5 should benefit from it and not pyo3. In case you have questions regarding the benchmark code, it is included below. pyo3 took around 0.5 to 1s on different runs (maybe 1s cold start, 0.5s when warmed up from previous runs?) and hdf5 around 5s. As presumed, the type conversion takes up almost all of the hdf5 time.
Are you sure the struct you request matches exactly with what is stored in the h5 file with regards to endianess/order/bitwidth? There is a lot of internal conversion going on, and I'd be surprised if hdf5-c
could not elide those if possible
h5py reports the file to be little-endian, which is the same as the native endianness of the system I ran the benchmark on, which is used by rust. So the endianness should not need to be converted. And even if endianness-conversion were necessary, wouldn't the python code have the same issue when the result is given in the same primitive type anyways?
For order/bitwidth I would not immediately know how to test those things, yet again, the resulting type for the user is ndarray::Array2<f32>
. If there were some magic buttons to press on the backend of things, h5py seems to hit the right ones. Are there optional flags or other functions I might have missed to read the data faster with hdf5?
So the issue seems to be conversion from f16
to f32
which is a slow path in hdf5-c
. numpy
conversion is for this a lot faster as it acts more like a native type. It would be great to file an issue upstream for this, as this should be a common conversion.
For reading such a format in this crate I do not have any quick suggestions. It might be useful to add a user-defined filter for this conversion where you add the quicker conversion algorithm, but this means fiddling with some C-code. Alternatively to add an opaque type descriptor which reads bytes with the interpretation left to the user as you have suggested.
It seems like you're comparing apples and oranges 🙂
So basically, you're comparing conversion capabilities of numpy and raw hdf5 (the former being much faster unsurprisingly), the Rust crate doesn't have much to do with it.
There's half::f16
which is probably the most sensible option might we want to support f16 floats. That's the one rust-numpy
apparently uses for binding f16
as well.
Out of wonder, I've just tested performance of half::f16::to_f32()
versus numpy's .astype(np.float32)
for f16 arrays.
For 100M elements (~190 MB):
Rust version:
use half::f16;
fn build_f16_vec(len: usize) -> Vec<f16> {
(0..len)
.map(|_| f16::from_bits(fastrand::u16(0..=u16::MAX)))
.collect()
}
fn criterion_benchmark(c: &mut Criterion) {
let len = 100_000_000;
let v = bb(build_f16_vec(len));
c.bench_function(&format!("convert_f32_f64_{len}"), |b| {
b.iter(|| {
let v = v.iter().map(|&x| x.to_f32()).collect::<Vec<_>>();
bb(v)
})
});
}
convert_f32_f64_100000000
time: [167.00 ms 167.54 ms 168.12 ms]
And numpy version:
>>> import numpy as np
>>> arr = (np.arange(0, 100_000_000) % (1 << 16)).astype(np.uint16).view(np.float16)
>>> %timeit arr.astype(np.float32)
164 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, performance is almost identical.
Actually, the data was not the same :) To make tests completely identical data-wise:
fn build_f16_vec(len: usize) -> Vec<f16> {
(0..=u16::MAX)
.cycle()
.take(len)
.map(f16::from_bits)
.collect()
}
which yields
convert_f32_f64_100000000
time: [129.84 ms 130.33 ms 130.87 ms]
and which is 20% faster than numpy if I'm reading it right.
So, I think, it's just a matter of us supporting half::f16
as an optional feature and adding float16<-->half::f16 bindings, as long as ndarray::Array works ok with that.
And then you could just arr.mapv(|x| x.to_f32())
, which should be a bit faster than the numpy path (but will consume 2x RAM in peak usage, unlike the hdf5's conversion path which is slower but does not consume extra memory).
As far as I can see, ndarray::Array
works perfectly fine with half::f16
.
Sorry for not getting back to this, had some busy two weeks. I guess opening an upstream issue is not relevant anymore if bypassing the hdf5 conversion becomes an option?
I'll take a quick look if we can hack in f16 on our side.
@eth42 float16 support is now in and can be enabled via "f16" feature (#246).
So, in your use case above, you can now just do
use half::f16;
let arr_f32 = ds.read_1d::<f16>()?.mapv(f16::to_f32);
and it should be not slower (probably a bit faster) than the h5py path.
For my use case (large 2d table in h5 file, stored as f16 cast to f32 for use in rust), the
hdf5
package is significantly slower than taking a detour in python viapyo3
. In python I use theh5py
package, casting it to f32 using numpy functions and transferring the resulting array back into rust. Using only rust code (hdf5
) takes up to 4.5 times as long as taking this detour. I assume that the issue is somewhere in the code that automatically casts values from one type to the other, since there are quite a few safety checks that could slow down the process, but I have not yet tested if the issue appears on native rust types (e.g. using raw f32 data as f32). I could imagine one of the following solutions:get_unchecked
forstd::vec::Vec
oruget
forndarray::ArrayBase
match
statement.All that is somewhat based on the premise, that casting truly is the issue, which I did not ensure, but the rest should be "just load the data from disk", so there is not much potential for slow-downs.
Below are two code snippets for loading a slice of a large table via
pyo3
+h5py
andhdf5
, respectively.pyo3+h5py
```rust // Trait to access the numpy name of rust primitives pub trait NumpyEquivalent: numpy::Element { fn numpy_name() -> &'static str; } macro_rules! make_numpy_equivalent { ($(($rust_types: ty, $numpy_names: literal)),*) => { $(make_numpy_equivalent!($rust_types, $numpy_names);)* }; ($rust_type: ty, $numpy_name: literal) => { impl NumpyEquivalent for $rust_type { fn numpy_name() -> &'static str { $numpy_name } } }; } make_numpy_equivalent!((f32, "float32"), (f64, "float64"), (bool, "bool_"), (u8, "uint8"), (u16, "uint16"), (u32, "uint32"), (u64, "uint64")); // Function to access a slice of a h5 file in a desired primitive type fn get_rows_slice_pyo3hdf5
```rust fn get_rows_slice_hdf5Necessary Cargo.toml lines
```toml [dependencies] ndarray = "0.15.6" hdf5 = "0.8.1" pyo3 = {version="0.18.3", features=["auto-initialize"]} ```Software versions
- hdf5: 0.8.1 - libhdf5-dev: 1.10.8 - pyo3: 0.18.3 - ndarray: 0.15.6 - python: 3.11.2 - numpy: 1.24.3 - h5py: 3.8.0