Too much opening and closing files - Githubissues

SWIFTSIM / swiftsimio

Python library for reading SWIFT data. Uses unyt and h5py.

GNU Lesser General Public License v3.0

18 stars 13 forks source link

Too much opening and closing files #153

Open bwvdnbro opened 1 year ago

bwvdnbro commented 1 year ago

Last week I got told off by Peter Draper for overloading the meta-data server of the cosma7 file system. The reason: I was running 192 parallel instances of a script that uses swiftsimio to extract galaxies from a snapshot for analysis. Apparently the meta-data server is queried whenever a file is opened, even if you just closed that same file after another reading operation. In swiftsimio, the snapshot file is opened (and almost immediately closed again):

when reading header information
when reading the list of datasets
when reading units meta-data
when reading basically every bit of meta-data for a dataset (the datasets are done in one read, but a separate read is used for different bits of meta-data).
when reading the cosmology
when reading the named columns
when reading individual datasets That is a lot. I understand the reason for reading datasets separately, but I don't see a good reason to open and close the file for meta-data; this can and should all be read in one go. That might already help a bit.

Even if the number of reads per load() can be reduced, the more fundamental issue is that, right now, we need to do a separate load() whenever the mask is changed (because you want to read a different galaxy for example). That is great for parallelisation, since it essentially means you can spawn lots of processes for different galaxies without needing to worry about parallel HDF5 (each process uses its own h5py instance). But it means a huge overhead (in file accesses and also in amount of data being read) when you are trying to read a significant fraction of the snapshot. We need a better mechanism to deal with that. Ideally, we would have a mechanism that reads large chunks of datasets in few operations (Lustre file systems like that), but hides this from the user.

@JBorrow any thoughts on this? We really want to use swiftsimio for the morphology pipeline we plan to use for COLIBRE, but without some way to improve performance for loops over galaxies, we will never be able to run this on a large box.

JBorrow commented 1 year ago

We could keep around a persistent handle for the snapshot and re-use it. I've avoided doing that so far as it seemed cleaner to me this way, and you could ask the question of how on earth we would ever give back the handle. I guess we would have that in the destructor of the swift dataset, but you would still need to query the metadata server once per galaxy (though that doesn't seem too bad to me).

Specifically, you could add a handle as a property of the SWIFTDataset.

JBorrow commented 1 year ago

From slack: Is this not a massive concern for distributed snapshots? There is surely no other way of doing this than swapping out file handles, otherwise we'll have thousands of handles hanging around. Maybe @jchelly has some insight here? Does HDF5 do some magic under the covers?

jchelly commented 1 year ago

Maybe swiftsimio needs an API that allows reading multiple arrays in one operation? That's how I usually deal with multi-file snapshots: make a list of everything I need from each file then open the files in sequence. It should be possible to get the number of file opens down to one opening of file 0 to read all of the metadata and then each time a region is requested each file with particles in that region is opened once. You could also use something like functools.lru_cache to keep the last few files around if it's likely the same files will be needed again.

I'm not sure if HDF5 does anything clever with the virtual file. It seems to delay opening sub files until you request data from them but I don't know how it decides to close them or what happens if there are more files than the maximum number of file handles.

JBorrow commented 1 year ago

So what yt does is hold onto a single file handle. Indeed John, that may be a nice way to solve this, but it would ruin our lovely attribute-based API.

I guess as long as we hold onto the file handle, maybe hdf5 will under the covers? I've always hated the idea of the distributed snapshots, but I guess people have to use them...

JBorrow commented 1 year ago

The best way to do this would be to add an @property to SWIFTMetadata that either presents the currently open file handle or opens the file. This would require a slight re-write of the way we handle the SWIFTUnits object.

JBorrow commented 1 year ago

Then, when loading multiple regions, you should be able to re-use the same SWIFTMetadata (the same file handle) even if you destroy the SWIFTDataset?

MatthieuSchaller commented 1 year ago

Can we load all the meta-data once and for all? That would already reduce the number of reads and is light-weight.

JBorrow commented 1 year ago

I've proposed a fix in #155.