How to distribute the data

goodmami commented 4 years ago

This issue is to track ideas for how to distribute the wordnet data for this library. Some concerns:

Data should be versioned (e.g., PWN-3.0, PWN-3.1)
Can the data be installed with pip without packaging it on PyPI?
- maybe with some kind of extra (pip install wn[pwn])
If not, it would be nice to have this repository only for code, and a separate repository that packages this one with the data.
- separate repositories could be per-wordnet, like wn-pwn, wn-jwn, etc.
- or many together (wn-omw)
What are the best (speed, terms of use, convenience) distribution strategies?
- pip install from PyPI
- get from some cloud storage
- git-lfs
- other?

goodmami commented 4 years ago

I searched through the setuptools docs and didn't see any obvious way to have a package install trigger a script to download data from some external source, so the "extras" (pip install wn[pwn]) method looks unlikely.

The spaCy method didn't seem so bad. Models are distributed as Python packages (not on PyPI, just as .tar.gz files) as assets on GitHub releases (they're too large to version in Git, and attaching them to releases is kinda like version tags). This way they can either be installed with a special "download" command (which selects the version automatically) or pip-installed (where the user manually chooses the version).

alvations commented 4 years ago

Data distribution is actually quite a rabbit hole. And many people have tried and none "scratches the itch" that NLP people has.... E.g. https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-data/

alvations commented 4 years ago

I do have something written up for DigitalOcean spaces. It can automatically up/download files. But it might not be ideal.

Proposed distribution flow:

Data maintainers upload and versions the data on digitialocean or S3 or GCS
- I supposed we have to do this for princeton, english and OMW for a start.
Data is write-ablitity is purely controlled by code with key
Data is readable to all without a key

Let me try a couple hacked up code for the above and see how it works out.

goodmami commented 4 years ago

Dolt seems nice for maintainers of wordnet resources but it's not immediately clear how we'd integrate that for our project (but I didn't go over the docs too closely). Forcing users to install a Dolt client feels a bit heavier than regular Python dependencies.

Anyway my criteria are that (1) we don't store data in the code repository and that (2) we have an easy way to get the data. You said you also want (3) pip-installability. If we allow regular downloads and pip-installs for data, that means there are two locations for the data: some cache or download directory, and in the Python environment modules. An unfortunate side-effect of the latter is if we setup in a virtual environment then delete the virtual environment and create a new one; the data would be deleted, too, whereas a cache directory (similar to ~/nltk_data, ~/.allennlp, etc.) could persist. It would also be nice to use a platform-appropriate location (like ~/.local/share/ in linux) using something like appdirs, but this is a topic for #2.

fcbond commented 4 years ago

nltk is likely to want to distribute the data, so can you pull it from there when needed?

Check to see if it has been installed already into ~/nltk_data or /usr/local/nltk or where ever and if not ask the user to download it (like in nltk)?

goodmami commented 4 years ago

I think we it shouldn't be too hard to maintain a list of project codes to lists of releases for each project, assuming there's some LMF on the web to download. If someone tries to load one that hasn't been downloaded and installed, we can throw an informative error so they know how to fix the problem. I also like the idea of using versions in ~/nltk_data/ if they are there, but then we might need a WNDB reader.

But if the NLTK uses this module and wants to use its own packaging, it might need to maintain those separately. I guess we won't know until we get something working.

alvations / gown

How to distribute the data #3