INCATools / ontology-development-kit

Bootstrap an OBO Library ontology
http://incatools.github.io/ontology-development-kit/
BSD 3-Clause "New" or "Revised" License
212 stars 53 forks source link

Facilitate or at least document how to share the OAK cache #1051

Open gouttegd opened 2 months ago

gouttegd commented 2 months ago

The Ontology Access Kit (OAK, aka oaklib from Python’s point of view, aka runoak from the command line’s point of view) is one of the tools/libraries provided by the ODK.

In fact the ODK is supposedly one of the easiest way for “non-technical” users to get access to OAK, because installing Python programs is still too difficult for many people.

When OAK is used to access online resources (for example with -i sqlite:obo:uberon, which accesses a pre-built SQLite version of Uberon), it attempts to cache a copy of those resources in the local filesystem, to avoid always re-downloading them upon each new call. The default location for the cache is ~/.data, or the value of the PYSTOW_HOME environment variable if such a variable is set.

(As an aside, defaulting to such a generic name under the user’s home directory is a terrible move, but that’s a deliberate decision that’s unlikely to ever change.)

Now when OAK is used from the ODK, the ~/.data directory is within the Docker container. So any file that OAK is storing there will only exist for as long as the container itself exists. That means that when people are running several OAK commands like this:

sh run.sh runoak -i sqlite:obo:uberon command1 ...
sh run.sh runoak -i sqlite:obo:uberon command2 ...
sh run.sh runoak -i sqlite:obo:uberon command3 ...

none of these commands will benefit from the cache. They will all download a fresh copy of Uberon.

One workaround is of course to run a shell within a container, instead of running runoak directly, and to then invoke runoak from that shell:

sh run.sh bash
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command1 ...
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command2 ...
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command3 ...

But that is not really a satisfying solution as it will still lead to Uberon being re-downloaded every time the user has to start working with it, even if maybe they already downloaded it the day before.

It is possible to configure the ODK to make the local cache visible from the container by “binding” the ~/.data directory from the local filesystem to the /home/odkuser/.data directory within the container, by adding the following in the src/ontology/run.sh.conf file:

ODK_BINDS=~/.data:/home/odkuser/.data

(This will only work once #1050 will have been fixed.)

Another solution would be set the PYSTOW_HOME variable to a directory within the repository (most likely somewhere under src/ontology/tmp), which is already bound to a mount point within the container. That would at least allow sharing the cache between ODK/OAK invocations that are run from within the same repository.

At the very least, the ODK should provide documentation on how to do that.

Should the ODK try to do that automatically? I am on the fence here. On one side, it’d be nice for users if the OAK cache could work “out of the box” without any extra configuration. On the other side, the ODK container is supposed to shield the local filesystem (except the actual repository) from any side-effects – everything that happens in the container stays in the container –, so it may not be a good idea to silently break a hole through the container’s wall: what if an interrupted download corrupts the cache? Users could expect that it would have no consequence, since the command was run inside a container – except that no, actually the cache is outside the container, so you’ve just corrupted your actual cache, oops!

Thoughts?