Open bwvdnbro opened 1 year ago
We could keep around a persistent handle for the snapshot and re-use it. I've avoided doing that so far as it seemed cleaner to me this way, and you could ask the question of how on earth we would ever give back the handle. I guess we would have that in the destructor of the swift dataset, but you would still need to query the metadata server once per galaxy (though that doesn't seem too bad to me).
Specifically, you could add a handle as a property of the SWIFTDataset
.
From slack: Is this not a massive concern for distributed snapshots? There is surely no other way of doing this than swapping out file handles, otherwise we'll have thousands of handles hanging around. Maybe @jchelly has some insight here? Does HDF5 do some magic under the covers?
Maybe swiftsimio needs an API that allows reading multiple arrays in one operation? That's how I usually deal with multi-file snapshots: make a list of everything I need from each file then open the files in sequence. It should be possible to get the number of file opens down to one opening of file 0 to read all of the metadata and then each time a region is requested each file with particles in that region is opened once. You could also use something like functools.lru_cache to keep the last few files around if it's likely the same files will be needed again.
I'm not sure if HDF5 does anything clever with the virtual file. It seems to delay opening sub files until you request data from them but I don't know how it decides to close them or what happens if there are more files than the maximum number of file handles.
So what yt
does is hold onto a single file handle. Indeed John, that may be a nice way to solve this, but it would ruin our lovely attribute-based API.
I guess as long as we hold onto the file handle, maybe hdf5 will under the covers? I've always hated the idea of the distributed snapshots, but I guess people have to use them...
The best way to do this would be to add an @property
to SWIFTMetadata
that either presents the currently open file handle or opens the file. This would require a slight re-write of the way we handle the SWIFTUnits
object.
Then, when loading multiple regions, you should be able to re-use the same SWIFTMetadata
(the same file handle) even if you destroy the SWIFTDataset
?
Can we load all the meta-data once and for all? That would already reduce the number of reads and is light-weight.
I've proposed a fix in #155.
Last week I got told off by Peter Draper for overloading the meta-data server of the cosma7 file system. The reason: I was running 192 parallel instances of a script that uses swiftsimio to extract galaxies from a snapshot for analysis. Apparently the meta-data server is queried whenever a file is opened, even if you just closed that same file after another reading operation. In swiftsimio, the snapshot file is opened (and almost immediately closed again):
Even if the number of reads per
load()
can be reduced, the more fundamental issue is that, right now, we need to do a separateload()
whenever the mask is changed (because you want to read a different galaxy for example). That is great for parallelisation, since it essentially means you can spawn lots of processes for different galaxies without needing to worry about parallel HDF5 (each process uses its ownh5py
instance). But it means a huge overhead (in file accesses and also in amount of data being read) when you are trying to read a significant fraction of the snapshot. We need a better mechanism to deal with that. Ideally, we would have a mechanism that reads large chunks of datasets in few operations (Lustre file systems like that), but hides this from the user.@JBorrow any thoughts on this? We really want to use swiftsimio for the morphology pipeline we plan to use for COLIBRE, but without some way to improve performance for loops over galaxies, we will never be able to run this on a large box.