Slow performance compared to h5py

eth42 commented 1 year ago

For my use case (large 2d table in h5 file, stored as f16 cast to f32 for use in rust), the hdf5 package is significantly slower than taking a detour in python via pyo3. In python I use the h5py package, casting it to f32 using numpy functions and transferring the resulting array back into rust. Using only rust code (hdf5) takes up to 4.5 times as long as taking this detour. I assume that the issue is somewhere in the code that automatically casts values from one type to the other, since there are quite a few safety checks that could slow down the process, but I have not yet tested if the issue appears on native rust types (e.g. using raw f32 data as f32). I could imagine one of the following solutions:

Reduce the amount of safety checks in the automatic casting code (probably not possible or the checks would not be there)
Add optional unsafe functions to remove all checks when desired, similar to get_unchecked for std::vec::Vec or uget for ndarray::ArrayBase
Enable the user to load data from the h5 file in its raw representation, i.e. return an uncast version of the array so that the user can perform whatever unsafe casting is desired, though I am not sure, how that would be possible in rust due to the uncertain return type. Maybe add generic raw-access functions that return an error whenever the generic does not match the raw type? In that case the user would need a getter for the raw type of the file, which could be used in a match statement.

All that is somewhat based on the premise, that casting truly is the issue, which I did not ensure, but the rest should be "just load the data from disk", so there is not much potential for slow-downs.

Below are two code snippets for loading a slice of a large table via pyo3+h5py and hdf5, respectively.

pyo3+h5py

```rust // Trait to access the numpy name of rust primitives pub trait NumpyEquivalent: numpy::Element { fn numpy_name() -> &'static str; } macro_rules! make_numpy_equivalent { ($(($rust_types: ty, $numpy_names: literal)),*) => { $(make_numpy_equivalent!($rust_types, $numpy_names);)* }; ($rust_type: ty, $numpy_name: literal) => { impl NumpyEquivalent for $rust_type { fn numpy_name() -> &'static str { $numpy_name } } }; } make_numpy_equivalent!((f32, "float32"), (f64, "float64"), (bool, "bool_"), (u8, "uint8"), (u16, "uint16"), (u32, "uint32"), (u64, "uint64")); // Function to access a slice of a h5 file in a desired primitive type fn get_rows_slice_pyo3(file: &str, dataset: &str, i_row_from: usize, i_row_to: usize) -> Result, pyo3::PyErr> { pyo3::Python::with_gil(|py| { let locals = pyo3::types::PyDict::new(py); locals.set_item("h5py", py.import("h5py")?)?; locals.set_item("np", py.import("numpy")?)?; locals.set_item("data", py.eval( format!("h5py.File(\"{:}\")[\"{:}\"]", file, dataset).as_str(), None, Some(&locals) )?)?; locals.set_item("start", i_row_from)?; locals.set_item("end", i_row_to)?; let row_obj = py.eval( format!("data[start:end].astype(np.{:})", T::numpy_name()).as_str(), None, Some(&locals) )?; let row: &numpy::PyArray2 = row_obj.downcast()?; Ok(row.to_owned_array()) }) } ```

hdf5

```rust fn get_rows_slice_hdf5(file: &str, dataset: &str, i_row_from: usize, i_row_to: usize) -> Result, hdf5::Error> { let file = hdf5::File::open(file)?; let data = file.dataset(dataset)?; data.read_slice_2d(ndarray::s![i_row_from..i_row_to,..]) } ```

Necessary Cargo.toml lines

```toml [dependencies] ndarray = "0.15.6" hdf5 = "0.8.1" pyo3 = {version="0.18.3", features=["auto-initialize"]} ```

Software versions

- hdf5: 0.8.1 - libhdf5-dev: 1.10.8 - pyo3: 0.18.3 - ndarray: 0.15.6 - python: 3.11.2 - numpy: 1.24.3 - h5py: 3.8.0

mulimoen commented 1 year ago

I can't immediately see where the slowness might come from. How many elements are you reading at once? Is the slowness dependent on the chunk size?

If you could run flamegraph or share the dataset that would be great in helping us debug!

eth42 commented 1 year ago

The chunk size does not appear to have an impact, the times for both approaches scale linearly with the chunk size. I reworked my code to solely use the pyo3 version, but I can run some benchmarks in the coming days. Will post here again once I have a flamegraph.

eth42 commented 1 year ago

So here is a flamegraph then. I read the first 100k rows (~150MB) of f16 from the file and returned them as an f32 ndarray. The call to pyo3 was first and the call to hdf5 second, so if caching has anything to do with it, hdf5 should benefit from it and not pyo3. In case you have questions regarding the benchmark code, it is included below. pyo3 took around 0.5 to 1s on different runs (maybe 1s cold start, 0.5s when warmed up from previous runs?) and hdf5 around 5s. As presumed, the type conversion takes up almost all of the hdf5 time.

Benchmark code

```rust fn main() { let file = "..."; /* Placeholder for real path */ let dataset = "..."; /* Placeholder for real name */ let start = std::time::Instant::now(); let _ = get_rows_slice_pyo3::(file, dataset, 0, 100_000); let end = std::time::Instant::now(); let pyo3_time = end.duration_since(start).as_millis(); let start = std::time::Instant::now(); let _ = get_rows_slice_hdf5::(file, dataset, 0, 100_000); let end = std::time::Instant::now(); let hdf5_time = end.duration_since(start).as_millis(); println!("pyo3: {:>6}ms", pyo3_time); println!("hdf5: {:>6}ms", hdf5_time); } ```

Flamegraph

![benchmark](https://github.com/aldanor/hdf5-rust/assets/9088734/ea049184-ea37-4677-9231-094555a199a6)

mulimoen commented 1 year ago

Are you sure the struct you request matches exactly with what is stored in the h5 file with regards to endianess/order/bitwidth? There is a lot of internal conversion going on, and I'd be surprised if hdf5-c could not elide those if possible

eth42 commented 1 year ago

h5py reports the file to be little-endian, which is the same as the native endianness of the system I ran the benchmark on, which is used by rust. So the endianness should not need to be converted. And even if endianness-conversion were necessary, wouldn't the python code have the same issue when the result is given in the same primitive type anyways? For order/bitwidth I would not immediately know how to test those things, yet again, the resulting type for the user is ndarray::Array2<f32>. If there were some magic buttons to press on the backend of things, h5py seems to hit the right ones. Are there optional flags or other functions I might have missed to read the data faster with hdf5?

mulimoen commented 1 year ago

So the issue seems to be conversion from f16 to f32 which is a slow path in hdf5-c. numpy conversion is for this a lot faster as it acts more like a native type. It would be great to file an issue upstream for this, as this should be a common conversion.

For reading such a format in this crate I do not have any quick suggestions. It might be useful to add a user-defined filter for this conversion where you add the quicker conversion algorithm, but this means fiddling with some C-code. Alternatively to add an opaque type descriptor which reads bytes with the interpretation left to the user as you have suggested.

aldanor commented 1 year ago

It seems like you're comparing apples and oranges 🙂

in the h5py/pyo3 path, you're reading it natively into f32 without any conversions; then you're doing f32->f16 conversion in numpy
in the hdf5-rust path, you're using hdf5 to do the conversion

So basically, you're comparing conversion capabilities of numpy and raw hdf5 (the former being much faster unsurprisingly), the Rust crate doesn't have much to do with it.

There's half::f16 which is probably the most sensible option might we want to support f16 floats. That's the one rust-numpy apparently uses for binding f16 as well.

aldanor commented 1 year ago

Out of wonder, I've just tested performance of half::f16::to_f32() versus numpy's .astype(np.float32) for f16 arrays.

For 100M elements (~190 MB):

Rust version:

use half::f16;

fn build_f16_vec(len: usize) -> Vec<f16> {
    (0..len)
        .map(|_| f16::from_bits(fastrand::u16(0..=u16::MAX)))
        .collect()
}

fn criterion_benchmark(c: &mut Criterion) {
    let len = 100_000_000;
    let v = bb(build_f16_vec(len));
    c.bench_function(&format!("convert_f32_f64_{len}"), |b| {
        b.iter(|| {
            let v = v.iter().map(|&x| x.to_f32()).collect::<Vec<_>>();
            bb(v)
        })
    });
}

convert_f32_f64_100000000
                        time:   [167.00 ms 167.54 ms 168.12 ms]

And numpy version:

>>> import numpy as np
>>> arr = (np.arange(0, 100_000_000) % (1 << 16)).astype(np.uint16).view(np.float16)
>>> %timeit arr.astype(np.float32)
164 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, performance is almost identical.

aldanor commented 1 year ago

Actually, the data was not the same :) To make tests completely identical data-wise:

fn build_f16_vec(len: usize) -> Vec<f16> {
    (0..=u16::MAX)
        .cycle()
        .take(len)
        .map(f16::from_bits)
        .collect()
}

which yields

convert_f32_f64_100000000
                        time:   [129.84 ms 130.33 ms 130.87 ms]

and which is 20% faster than numpy if I'm reading it right.

aldanor commented 1 year ago

So, I think, it's just a matter of us supporting half::f16 as an optional feature and adding float16<-->half::f16 bindings, as long as ndarray::Array works ok with that.

And then you could just arr.mapv(|x| x.to_f32()), which should be a bit faster than the numpy path (but will consume 2x RAM in peak usage, unlike the hdf5's conversion path which is slower but does not consume extra memory).

eth42 commented 1 year ago

As far as I can see, ndarray::Array works perfectly fine with half::f16. Sorry for not getting back to this, had some busy two weeks. I guess opening an upstream issue is not relevant anymore if bypassing the hdf5 conversion becomes an option?

aldanor commented 1 year ago

I'll take a quick look if we can hack in f16 on our side.

aldanor commented 1 year ago

@eth42 float16 support is now in and can be enabled via "f16" feature (#246).

So, in your use case above, you can now just do

use half::f16;
let arr_f32 = ds.read_1d::<f16>()?.mapv(f16::to_f32);

and it should be not slower (probably a bit faster) than the h5py path.

aldanor / hdf5-rust

Slow performance compared to h5py #236