hdf5: parallel file io without pre-splitting files into chunks

rohany commented 3 years ago

It seems like HDF5 has enough information within a file to support parallel reads of a single large file without having to presplit the file into multiple chunks -- such an API seems pretty useful for reading large data sets. I discussed this on slack with @manopapad, and the takeaway was that an external library could be written that does this:

A top level task reads metadata from the hdf5 file about how big to make target regions etc
The top level task then partitions the target region(s)
The top level task performs an index launch of a task for each node over the created partitions
Each task in the index launch uses the HDF5 API directly to read the chunk(s) of the HDF5 file corresponding to the bounds of the partitions passed to each subtask.

I'm not very knowledgeable around HDF5 or fileio within Legion, but it seems like such a feature could be part of Legion or at least as a common library for people to use. I bet some discussion has come up around this, so I'd like to hear people's thoughts (cc @lightsighter, @streichler).

A straw man API for an operation like this:

class Launcher {
  Launcher(std::string filename);
  add_region_requirement(RegionRequirement, map<FieldID, const char*>) // User specifies partitions that they want to load into and what HDF5 fields to read from. Requires that the partition is disjoint.
};

There are several considerations before using something like this:

The target machine needs to have a distributed file system
It's not clear to me how this would integrate with the attach/detach semantics
This API seems like it would be only usable along the read path, parallel writes of each partition to a single hdf5 file doesn't seem easy.

lightsighter commented 3 years ago

I would benchmark this first without Legion and make sure you're not going to hammer the file system by having all the nodes reading the HDF5 file at the same time (even just the meta-data). Distributed file systems may enable replication of files, but if it's a big file, they're unlikely to do that and if you don't get replication you'll effectively end up DDOS-ing the node where the file lives.

I suspect the better answer for Legion is to take advantage of the natural hierarchy in HDF5 files and try to do an index attach operation to the sub-files in the HDF5 file. The top-level file will be small and easily replicated by the distributed file system. The sub-files will be big but naturally sharded across the nodes. At a minimum it will be better than using index space tasks to explicitly read the data as Legion can be lazy and then instruct the Realm's DMA system to move data directly to where it is needed when it is ultimately consumed. No point in guessing where to put the data when you don't know what is down-stream in the program.

elliottslaughter commented 3 years ago

I have experience with trying to maximize bandwidth out of a distributed file system like Lustre. I got to about 10% of global bandwidth with Lustre using (if I recall) only a handful of large files.

At least in the case of Lustre, it does shard large files across multiple file servers. The chunk size is configurable, but generally on the order of 1MB. While it's better to line up the shards so that there's a 1-1 mapping of ranks requesting file contents, and servers, that's not strictly necessary, and I hit my 10% number without doing that.

Having said that, one thing to note about my use cases was that the file format was not HDF5. Instead, we had our own streaming format which meant we could read all of the relevant data in a single pread call from a given node (i.e., because all data for a given node was contiguous). I'd hope that HDF5 would be smart about minimizing read calls, but since it's an additional layer of abstraction, that's just something to watch out for.

As long as the data is read-only, you shouldn't get too much contention in the metadata server. Of course this pattern will be awful for writes; if you want to write, I think you really are better off writing distinct files.

If you want to push past 10% of global Lustre bandwidth, you may need to tune things more carefully, but I think for basic use cases where you're not too worried about performance, the easy thing should be fine.

StanfordLegion / legion

hdf5: parallel file io without pre-splitting files into chunks #1137