NCAS-CMS / pyfive

A pure Python HDF5 file reader
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

h5 chunk cache? #6

Open bnlawrence opened 2 months ago

bnlawrence commented 2 months ago

From the h5py docs:

Chunk cache Chunked storage allows datasets to be stored on disk in separate pieces. When a part of any one of these pieces is needed, the entire chunk is read into memory before the requested part is copied to the user’s buffer. To the extent possible those chunks are cached in memory, so that if the user requests a different part of a chunk that has already been read, the data can be copied directly from memory rather than reading the file again. The details of a given dataset’s chunks are controlled when creating the dataset, but it is possible to adjust the behavior of the chunk cache when opening the file.

The parameters controlling this behavior are prefixed by rdcc, for raw data chunk cache. They apply to all datasets unless specifically changed for each one.

rdcc_nbytes sets the total size (measured in bytes) of the raw data chunk cache for each dataset. The default size is 1 MiB. This should be set to the size of each chunk times the number of chunks that are likely to be needed in cache.

rdcc_w0 sets the policy for chunks to be removed from the cache when more space is needed. If the value is set to 0, then the library will always evict the least recently used chunk in cache. If the value is set to 1, the library will always evict the least recently used chunk which has been fully read or written, and if none have been fully read or written, it will evict the least recently used chunk. If the value is between 0 and 1, the behavior will be a blend of the two. Therefore, if the application will access the same data more than once, the value should be set closer to 0, and if the application does not, the value should be set closer to 1.

rdcc_nslots is the number of chunk slots in the cache for each dataset. In order to allow the chunks to be looked up quickly in cache, each chunk is assigned a unique hash value that is used to look up the chunk. The cache contains a simple array of pointers to chunks, which is called a hash table. A chunk’s hash value is simply the index into the hash table of the pointer to that chunk. While the pointer at this location might instead point to a different chunk or to nothing at all, no other locations in the hash table can contain a pointer to the chunk in question. Therefore, the library only has to check this one location in the hash table to tell if a chunk is in cache or not. This also means that if two or more chunks share the same hash value, then only one of those chunks can be in the cache at the same time. When a chunk is brought into cache and another chunk with the same hash value is already in cache, the second chunk must be evicted first. Therefore it is very important to make sure that the size of the hash table (which is determined by the rdcc_nslots parameter) is large enough to minimize the number of hash value collisions. Due to the hashing strategy, this value should ideally be a prime number. As a rule of thumb, this value should be at least 10 times the number of chunks that can fit in rdcc_nbytes bytes. For maximum performance, this value should be set approximately 100 times that number of chunks. The default value is 521.

Chunks and caching are described in greater detail in the HDF5 documentation.