aldanor / hdf5-rust

HDF5 for Rust
https://docs.rs/hdf5
Apache License 2.0
308 stars 82 forks source link

Iterating over all groups and datasets is slow #258

Closed kdheepak closed 1 year ago

kdheepak commented 1 year ago

I have a HDF5 file that has almost 4492 datasets.

If I iterate over all the groups and then all the datasets in Julia, it takes around half a second:

julia> using HDF5

julia> @time HDF5.h5open("./data/database.hdf5") do f
           String[lstrip(HDF5.name(ds), '/') for g in f for ds in f[HDF5.name(g)]]
       end |> length
  0.575411 seconds (186.63 k allocations: 10.299 MiB, 23.19% compilation time)
4492

julia> @time HDF5.h5open("./data/database.hdf5") do f
           String[lstrip(HDF5.name(ds), '/') for g in f for ds in f[HDF5.name(g)]]
       end |> length
  0.573753 seconds (186.62 k allocations: 10.299 MiB, 22.96% compilation time)
4492

julia> @time HDF5.h5open("./data/database.hdf5") do f
           String[lstrip(HDF5.name(ds), '/') for g in f for ds in f[HDF5.name(g)]]
       end |> length
  0.625914 seconds (186.63 k allocations: 10.311 MiB, 1.68% gc time, 21.28% compilation time)
4492

However in Rust it takes almost a full minute. Here's the code that I'm using:

mod tests {
  #[test]
  fn test_read_names() -> color_eyre::eyre::Result<()> {
    let f = hdf5::File::open("./../Model/data/database.hdf5")?;
    let mut names = vec![];
    for group in f.groups()? {
      for ds in group.datasets()? {
        names.push(ds.name())
      }
    }
    dbg!(names.len());
    Ok(())
  }
}

And here's how I'm testing it after adding the above code to a ./src/names.rs:

$ cargo test -- names::tests::test_read_names --nocapture
    Finished test [unoptimized + debuginfo] target(s) in 0.45s
     Running unittests src/main.rs (target\debug\deps\hdf5_data_viewer-10574144f7a82b29.exe)

running 1 test
[src\names.rs:378] names.len() = 4492
test names::tests::test_read_set_names ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1 filtered out; finished in 56.24s

Do you know what I'm doing incorrectly? It is the exact same HDF5 file that I'm using in Rust and Julia.

kdheepak commented 1 year ago

Just iterating over the groups and datasets seems fast but it looks like the actual part that's slow is getting the ds.name(). Any suggestions for the fastest way to get the fully qualified dataset names for all the datasets in a file?

mulimoen commented 1 year ago

Could you try the iter_visit method instead of getting the name after iterating? We don't store the name of the variable and fetch using the id which might be slow

kdheepak commented 1 year ago

Thanks for your comment @mulimoen and all your work on this crate!

After looking at alternative methods, I ended up doing this instead:

      let f = hdf5::File::open(&file).unwrap();
      for group in f.member_names().unwrap() {
        for dataset in f.group(&group).unwrap().member_names().unwrap() {
          names.push(format!("{}/{}", group, dataset));
        }
      }

which worked out to be quite a bit faster. It now takes less than a 5 seconds to get all the names, open each dataset using the name, and read some metadata from each dataset. I didn't time the exact difference between this and Julia but this is now usable for me and I can close this issue.