Make hashing compatible with git repositories

rasmuse commented 5 years ago

In case a git repository associated with a boyle project contains relevant files (i.e., scripts used in tasks) it seems unnecessary to store another copy of those files in the boyle storage. This leads to two related ideas:

Build a Storage implementation that uses a git repo where possible (typically for version controlled scripts) and otherwise stores files in a separate boyle cache directory (typically for data inputs/outputs).
To facilitate this, it would be nice to use git's notion of blob and tree objects, including the hashing algorithm for these objects. Note that git is now transitioning from SHA-1 hashes to SHA-256 and in the future possibly other hash algorithms. It seems wise to be compatible with this work, and possibly also to make use of their implementations where possible.

kallus commented 5 years ago

I think these are two good ideas. If I'm not misunderstanding, they are independent of each other, both in that none of them requires the other and in terms of implementation?

Regarding the first idea:

In abstract terms, a Stored object could be either a reference to an object in the boyle cache or a reference to a git object (i.e. git blob?).
I think it will be typical that not all changes to scripts are commited to git whenever experiments are run so the boyle cache directory would typically hold data and scripts that have changed since last commit.

rasmuse commented 5 years ago

Yes, in principle I agree the two ideas are independent.

But in practice there is an appealing connection through the hash algorithm: if we adopt git's hash algorithm, then having computed only that hash, a stored object can be referenced to using the git-compatible hash, regardless of whether it is stored in the git repo or in a separate directory. This is good because the external interface of the storage is then simply "give me the object (blob or tree) corresponding to this hash". It is then up to the storage implementation to look for the file in a git repo or other cache folder(s).

In fact, an object could even be stored in some combination of the two storage locations: a git tree (concretely a directory of files) could be partly restored from the git repo and partly from another cache directory.

Agree 100% that not all changes to scripts will be committed when running experiments. I think one of the possible strengths of boyle will be to help backtrack what-the-hell-did-I-do when an interesting file is the result of an uncommitted mess of scripts and input files.

rasmuse commented 5 years ago

libgit seems to be a preferred way to embed git in an application. It has Python bindings that may be useful for implementing a git-compatible Storage.

Here is also some general documentation on git objects:

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

boyleworkflow / boyle

Make hashing compatible with git repositories #2