NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

implement LocalCache (1) #54

Closed magland closed 4 months ago

magland commented 5 months ago

Not sure if this is the best way to do it, but this is a candidate.

@rly

codecov-commenter commented 5 months ago

Codecov Report

Attention: Patch coverage is 76.63230% with 68 lines in your changes are missing coverage. Please review.

Project coverage is 82.34%. Comparing base (162f19d) to head (4cd9c1e).

Files Patch % Lines
lindi/LindiRemfile/LindiRemfile.py 73.33% 48 Missing :warning:
lindi/LindiH5pyFile/LindiH5pyFile.py 66.66% 11 Missing :warning:
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 53.33% 7 Missing :warning:
lindi/LocalCache/LocalCache.py 94.11% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #54 +/- ## ========================================== - Coverage 83.42% 82.34% -1.09% ========================================== Files 28 28 Lines 1907 2101 +194 ========================================== + Hits 1591 1730 +139 - Misses 316 371 +55 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

magland commented 4 months ago

Changes made:

Implemented a lindi.LocalCache(...) that can be passed in to the LindiH5pyFile constructor. It uses a sqlite database on disk to cache chunks of remote files.

Remove dependency on remfile and instead use an internal LindiRemfile. This is necessary so we can use the same sqlite local cache method as the rest of the package (remfile uses a different system). Recall that when EXTERNAL_ARRAY_LINK is used, an h5py client is used to load dataset chunks (this happens when the number of chunks is too large to include in the .lindi.json file). In this case, the lindi.LindiRemfile() is used, and the optional local_cache object is passed in to lindi.LindiRemfile(). There are some other minor differences between Remfile and LindiRemfile as well.

Replace lindi.LindiH5pyFile.from_reference_file_system(...) with lindi.LindiH5pyFile.from_lindi_file(...) (more intuitive)

Add a convenience function f.write_lindi_file(fname) where f is a lindi.LindiH5pyFile

rly commented 4 months ago

How large can this cache grow? If it is unbounded, then I think it would be useful to set a configurable maximum size on the cache so that it does not accidentally swamp the filesystem, especially since it is stored in a hidden directory, and users could be working with very large datasets.

magland commented 4 months ago

How large can this cache grow? If it is unbounded, then I think it would be useful to set a configurable maximum size on the cache so that it does not accidentally swamp the filesystem, especially since it is stored in a hidden directory, and users could be working with very large datasets.

@rly That's a good idea. How do you think we should manage the configuration? This shouldn't be a global setting since different scripts can utilize different cache directories.

rly commented 4 months ago

It could be a parameter on LocalCache but if might become unintuitive if you use the same cache directory for multiple applications. Let's punt the size limit as a nice-to-have feature.

Caching in fsspec has a lot of different options, such as expiry time, compression, cache chunks or whole files: https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/cached.html But they do not yet support a cache size limit: https://github.com/fsspec/filesystem_spec/issues/510

Since the LocalCache here is pretty simple, I'm fine with rolling our own, but would it make sense to use any of the fsspec CachingFileSystem classes?

magland commented 4 months ago

Since the LocalCache here is pretty simple, I'm fine with rolling our own, but would it make sense to use any of the fsspec CachingFileSystem classes?

The lindi.LocalCache does something very specific... it caches specific chunks of data of remote files... and that's it. I imagine fsspec's system has a much grander scope, but I haven't looked at it.

rly commented 4 months ago

fsspec's CachingFileSystem has a similar purpose but has a lot of code that hooks into the rest of fsspec. The default approach is different, but interesting - it creates a memory-mapped sparse file that is filled in as blocks are accessed. I think the sqlite approach is more flexible because I think we can more easily manipulate it, but memory-mapped data will probably be faster to load and more amenable to large chunks than blobs in a sqlite db. I haven't carefully examined the details of their approach. In any case, I think this local cache will work great until we need to optimize further.

CachingFileSystem

This class implements chunk-wise local storage of remote files, for quick access after the initial download. The files are stored in a given directory with hashes of URLs for the filenames. If no directory is given, a temporary one is used, which should be cleaned up by the OS after the process ends. The files themselves are sparse (as implemented in :class:~fsspec.caching.MMapCache), so only the data which is accessed takes up space.