allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.71k stars 229 forks source link

Enhancement: Provide option to modify cache folder for entity linker knowledge base downloads #415

Open davidshumway opened 2 years ago

davidshumway commented 2 years ago

https://github.com/allenai/scispacy/blob/2290a80cfe0948e48d8ecfbd60064019d57a6874/scispacy/file_cache.py#L16

For Google Colab users, the Path.home() location is /root/, which is deleted when the runtime is cleared. As runtimes are cleared fairly often, this means re-downloading the KBs. Perhaps there is a way to alter Path.home from pathlib? Another option is to allow the user to enter a cache folder, which Colab users could set to their Google Drive (fwiw just a regular folder as seen by python within Colab), thus making the download permanent.

dakinggg commented 2 years ago

I think you actually can do this, although admittedly I have not tried it. Can you try setting the SCISPACY_CACHE environment variable (used on this line https://github.com/allenai/scispacy/blob/3d153ddad1f11f000f961f7a92c0d862b93c0973/scispacy/file_cache.py#L16) to whatever folder you want to use, before importing the library?

davidshumway commented 2 years ago

Makes sense.

So it seems to pretty much be working with a bit of a workaround.

The files are initially cached to /root/.scispacy/datasets/.

After caching, move the cache folder to a permanent folder on Google drive:

!mv /root/.scispacy/ /content/gdrive/MyDrive/test/
!ls /content/gdrive/MyDrive/test/.scispacy/
>>> datasets

To update the environment variable, as described:

import os
os.environ['SCISPACY_CACHE'] = '/content/gdrive/MyDrive/test/.scispacy/'

However, this alone does not find the cached files. It will re-download the files again. In order to see the new environment variable, it's necessary to restart the runtime: Runtime->Restart runtime.

Now when running the entity linker, it will see the permanently cached files.

So is an enhancement necessary? It'd definitely be easier and more foolproof to simply add a parameter such as cache_folder to the nlp.add_pipe() method. For example:

nlp.add_pipe(
  "scispacy_linker",
  config={
    "resolve_abbreviations": True,
    "linker_name": "umls",
    "cache_folder": "/content/gdrive/MyDrive/test/"})

which would then be used to look for a subfolder .scispacy, i.e. /content/gdrive/MyDrive/test/.scispacy/ in this case.