boyleworkflow / boyle

A tool for provenance and caching in computational workflows.
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Make hashing compatible with git repositories #2

Open rasmuse opened 5 years ago

rasmuse commented 5 years ago

In case a git repository associated with a boyle project contains relevant files (i.e., scripts used in tasks) it seems unnecessary to store another copy of those files in the boyle storage. This leads to two related ideas:

kallus commented 5 years ago

I think these are two good ideas. If I'm not misunderstanding, they are independent of each other, both in that none of them requires the other and in terms of implementation?

Regarding the first idea:

rasmuse commented 5 years ago

Yes, in principle I agree the two ideas are independent.

But in practice there is an appealing connection through the hash algorithm: if we adopt git's hash algorithm, then having computed only that hash, a stored object can be referenced to using the git-compatible hash, regardless of whether it is stored in the git repo or in a separate directory. This is good because the external interface of the storage is then simply "give me the object (blob or tree) corresponding to this hash". It is then up to the storage implementation to look for the file in a git repo or other cache folder(s).

In fact, an object could even be stored in some combination of the two storage locations: a git tree (concretely a directory of files) could be partly restored from the git repo and partly from another cache directory.


Agree 100% that not all changes to scripts will be committed when running experiments. I think one of the possible strengths of boyle will be to help backtrack what-the-hell-did-I-do when an interesting file is the result of an uncommitted mess of scripts and input files.

rasmuse commented 5 years ago

libgit seems to be a preferred way to embed git in an application. It has Python bindings that may be useful for implementing a git-compatible Storage.

Here is also some general documentation on git objects: