Closed magland closed 6 months ago
Attention: Patch coverage is 76.63230%
with 68 lines
in your changes are missing coverage. Please review.
Project coverage is 82.34%. Comparing base (
162f19d
) to head (4cd9c1e
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Changes made:
Implemented a lindi.LocalCache(...)
that can be passed in to the LindiH5pyFile constructor. It uses a sqlite database on disk to cache chunks of remote files.
Remove dependency on remfile and instead use an internal LindiRemfile. This is necessary so we can use the same sqlite local cache method as the rest of the package (remfile uses a different system). Recall that when EXTERNAL_ARRAY_LINK is used, an h5py client is used to load dataset chunks (this happens when the number of chunks is too large to include in the .lindi.json file). In this case, the lindi.LindiRemfile() is used, and the optional local_cache object is passed in to lindi.LindiRemfile(). There are some other minor differences between Remfile and LindiRemfile as well.
Replace lindi.LindiH5pyFile.from_reference_file_system(...)
with lindi.LindiH5pyFile.from_lindi_file(...)
(more intuitive)
Add a convenience function f.write_lindi_file(fname)
where f is a lindi.LindiH5pyFile
How large can this cache grow? If it is unbounded, then I think it would be useful to set a configurable maximum size on the cache so that it does not accidentally swamp the filesystem, especially since it is stored in a hidden directory, and users could be working with very large datasets.
How large can this cache grow? If it is unbounded, then I think it would be useful to set a configurable maximum size on the cache so that it does not accidentally swamp the filesystem, especially since it is stored in a hidden directory, and users could be working with very large datasets.
@rly That's a good idea. How do you think we should manage the configuration? This shouldn't be a global setting since different scripts can utilize different cache directories.
It could be a parameter on LocalCache
but if might become unintuitive if you use the same cache directory for multiple applications. Let's punt the size limit as a nice-to-have feature.
Caching in fsspec has a lot of different options, such as expiry time, compression, cache chunks or whole files: https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/cached.html But they do not yet support a cache size limit: https://github.com/fsspec/filesystem_spec/issues/510
Since the LocalCache
here is pretty simple, I'm fine with rolling our own, but would it make sense to use any of the fsspec CachingFileSystem
classes?
Since the
LocalCache
here is pretty simple, I'm fine with rolling our own, but would it make sense to use any of the fsspecCachingFileSystem
classes?
The lindi.LocalCache does something very specific... it caches specific chunks of data of remote files... and that's it. I imagine fsspec's system has a much grander scope, but I haven't looked at it.
fsspec's CachingFileSystem
has a similar purpose but has a lot of code that hooks into the rest of fsspec. The default approach is different, but interesting - it creates a memory-mapped sparse file that is filled in as blocks are accessed. I think the sqlite approach is more flexible because I think we can more easily manipulate it, but memory-mapped data will probably be faster to load and more amenable to large chunks than blobs in a sqlite db. I haven't carefully examined the details of their approach. In any case, I think this local cache will work great until we need to optimize further.
CachingFileSystem
This class implements chunk-wise local storage of remote files, for quick access after the initial download. The files are stored in a given directory with hashes of URLs for the filenames. If no directory is given, a temporary one is used, which should be cleaned up by the OS after the process ends. The files themselves are sparse (as implemented in :class:
~fsspec.caching.MMapCache
), so only the data which is accessed takes up space.
Not sure if this is the best way to do it, but this is a candidate.
@rly