Inject dataset reading via `Hdf5Reader`.

1uc commented 10 months ago

The idea is that one injects call-backs for reading selections from datasets. These callbacks implement:

template<class T>
std::vector<T> readSelection(const HighFive::DataSet& dset, Selection& selection);
HighFive::File open_file(const std::string& filename);

The default calls _readSelection. Separate readers can implement variants suitable for their purpose, e.g. MPI-IO. The advantage is that the reader has control over both the required collective semantics; and setting HDF5 properties. This allows some readers to have non-collective behaviour, such as short-circuiting for empty selections. While others implement strictly collective reading of datasets.

Additionally, we can create a suite of readers for different usecases, e.g. MPI-IO for neurodamus, aggregating reader for serial workloads with GPFS (and a separate one for LUSTER if need be), the default implementation because it minimizes number of bytes read.

ferdonline commented 10 months ago

Interesting design. As I understand you are implementing a collective reader already, so in Neurodamus we would need basically to call make_collective_reader and use it as argument in calls to the config which return storage. I think this is fine.

1uc commented 10 months ago

Nice! If I get this correct, we can just pull out the collective read interface into it's own library?

Yes, almost certainly, because mpi4py is a build dependency; and one can't have optional build dependencies (nicely).

1uc commented 10 months ago

This PR has been heavily reworked since the previous round of discussions. Eventually it will only include the Hdf5Reader API. The two commits that refactor libsonata have been split into their own PRs. The implementation of an MPI-IO reader has been moved to a separate PR (until it can be moved to its own repository).

1uc commented 9 months ago

The questions for review are:

Are the rules regarding collectiveness of libsonata acceptable?
Do we need the ability to pass options to the reader on getAttribute and similar? Currently, I can only think of collective which would be used to specify if the call is collective. Things like block size, I fell would go into the reader.
Naming of Hdf5Reader.

The next review would include question like:

Can we have an optional dependency libsonata[mpi]? The advantage is that libsonata can control the version of libsonata_mpi it's compatible with, which is useful if the API of Hdf5Reader needs to change.
Can we have wrappers for libsonata_mpi.make_collective_reader in libsonata? The advantage is that one could then always do libsonata.make_collective_reader which would always succeed. If libsonata_mpi is available it will return a collective reader, otherwise it'll return the default reader.

With the above, code written to use collective IO, should automatically run even if the user can't install mpi4py or libsonata_mpi. Therefore, users can avoid the pain of installing the difficult dependencies; and devs don't need to worry about making it easy for their users.

mgeplf commented 9 months ago

Last chance for comments @matz-e / @ferdonline; otherwise I'll approve this at the end of the day.

1uc commented 9 months ago

@mgeplf we're in a reasonably good place to hold off on merging this. With all the preliminary work out of the way, we'll not suffer too much from merge conflicts and we can work on multiple other things (like cleaning up a bit more after #319).

We only need this in once we have it fully working and tested in neurodamus. Until then retaining the freedom to change details in might be nice.

BlueBrain / libsonata

Inject dataset reading via `Hdf5Reader`. #307