proposal: garbage collection

magland commented 3 years ago

I propose the following mechanism for managing garbage collection. Introduce the following new Python functions:

kp.set_ref(key, obj, meta)
obj, meta = kp.get_ref(key)
kp.del_ref(key)
keys = kp.get_ref_keys(query)

key is a string obj is a json-able dict, list or string containing sha1://.... references to files in kachery storage meta is a json-able object query is a query for retrieving keys

What it does:

# This sends a request to the kachery-p2p daemon
# The daemon sets a new record (or replaces existing) in the ref database
# key -> (obj, meta)
kp.set_ref('key1', {'a': 'sha1://01...', 'b': 'sha1://02...'}, {'some-info': 'for-the-garbage-collector'})

# This will fetch the (obj, meta) pair from the database (request to kachery-p2p daemon)
obj, meta = kp.get_ref('key1')

# This will fetch the (obj, meta) pair from the database (request to kachery-p2p daemon)
obj, meta = kp.get_ref('key1')

# This will delete the 'key1' record from the database (request to kachery-p2p daemon)
obj, meta = kp.del_ref('key1')

# This will fetch a list of keys database based on the query (request to kachery-p2p daemon)
keys = kp.get_ref_keys({})
# keys will be ['key1', ...]

The (key, obj, meta) entries in the database may expire based on some rules in meta. For example, meta might include an expiration date.

kachery-p2p daemon will be configured to do periodic garbage collection. Files in kachery storage will be deleted based on the user-imposed limits on the total amount of space to use and the priority of files. Priority is determined based on the following factors:

Is there a reference to the file somewhere in the ref database? That means a (key, obj, meta) entry where the sha1://... is embedded somewhere in the obj.
If there is a reference, what info is in the meta? Does the meta info require that the referenced files be kept forever?
Files without a reference will be deleted first
Larger files will be deleted with higher priority than smaller files (thus really small files may stick around for a long time)

@jsoules

magland commented 3 years ago

Example usage. A hither job cache can be configured to create references to output files that are created by cached hither jobs. Those references can be set to expire after a certain amount of time. For example, 7 days after the last time a particular cached job was referenced. Thus old (unused) results will eventually get garbage collected.

jsoules commented 3 years ago

You've raised a really important point, which is that there may be dependencies based on files in e.g. JSON format. So e.g. myresults.json includes a raw_data_sha1 field which refers to file 126457, but there is not explicit (in kachery) reference to 126457; the kachery user will be required to make sure that they know that file is needed and keep it around.

(This isn't different from any other "make sure you don't delete your research" situation, but it's important to be clear on the scope of what the systems do.)

It makes sense to have time-based expiry of hither job cache output, for sure. But my instinct is that any job which writes a file as output is probably something that should be kept unless manually deleted?

magland commented 3 years ago

@jsoules. Do you have any comments/concerns about the python api? I was thinking it may be necessary to have groups or folders of keys so that something like the hither cache doesn't overwhelm the key space, and make it impossible to browse to selectively delete references. So maybe... kp.set_ref(group, key, obj, meta) or something.

jsoules commented 3 years ago

Thinking about this some more.

First--not sure what you mean about the hither cache overwhelming the key space & browsing. How would the database be realized--is it in filesystem, SQLite or something, ...? I'm not clear on how browsing would be accomplished or how a user would want to interact with the key store in that fashion.

To take a step back, the API is binding a key to an object and metadata. The object appears to be a collection of keys to SHA1 fingerprints. So is this essentially a way of collecting together several different kachery records and indicating that there is a reference to them, along with aliases to those kachery records? So like, I could have a key of AnimalSubject1, an obj of {'recoding': 'sha1://1234...', 'sorting': 'sha1://5678....', 'curationFeed': 'sha1://90ab...', 'snippets': 'sha1://cdef....'}, and meta like {'expires': '2021-03-31', 'format': 'nwb', ...}?

Then when that 'expires' time has passed, this key gets deleted; and the system is intelligent enough to know that, say, there are no more references to sha1://1234... so the recording file may be deleted, but there are other references to sha1://cdef.... so the snippets file will be preserved?

How does the query functionality work? Are we envisioning querying for keys matching a certain pattern; or for any references which include a reference to file sha1://1a2b...; or by metadata keys, or...?

magland commented 3 years ago

First--not sure what you mean about the hither cache overwhelming the key space & browsing. How would the database be realized--is it in filesystem, SQLite or something, ...? I'm not clear on how browsing would be accomplished or how a user would want to interact with the key store in that fashion.

I think 'browsing' wasn't the right term. What I mean is that the user needs some way to query/list/inspect the database in order to be able to selectively clean up (ie delete) items. I was imagining a file browser type thing in addition to Python and command-line tools. But there probably should be a hierarchy of keys (like with groups) because otherwise they will all be mixed together, and the hither job cache may have thousands of entries.

It is SQite database.

flatironinstitute / kachery-p2p

proposal: garbage collection #19