aparrish / pycorpora

A simple Python interface for Darius Kazemi's Corpora Project.
MIT License
119 stars 24 forks source link

How to update Corpora data? #1

Closed hugovk closed 9 years ago

hugovk commented 9 years ago

Once installed, what's the best way to update data from the Corpora Project, 1) on the command line, 2) using the pycorpora library/Python?

aparrish commented 9 years ago

I'd thought about this but hadn't come up with a great solution. The way that works right now is to use pip to force a reinstall of the package:

pip install --upgrade --force-reinstall pycorpora

Hopefully this is a feasible for folks in the short term, but I'm open to other ideas, with a few caveats...

My main "philosophical" goal for this library was to make it as easy to install as possible—just pip install it and forget it. I don't like how (e.g.) nltk requires corpus downloads as a second step (i.e. nltk.download()), since in my experience it's confusing to newcomers and it makes it more difficult for the library installation process to be easily repeatable—a requirements.txt file isn't enough to specify all of your dependencies.

That's also the reason that the library installs the corpora project data in the package—Python manages the location of the data for you, so you don't have to worry about having permission to write to /usr/share (or whatever), or having to remember where in your home directory you stashed the data.

The drawback of this is that the data is stored in the package, so it needs to be managed with special care (I'm using pkg_resources for this). There isn't an easy/official way (as far as I can tell) to modify package data externally and on-the-fly, and it's especially complicated if the package is stored in a zipped egg. So an in-code solution (e.g., something like pycorpora.download()) is probably out of the question without non-trivial haxx and/or sploits to either (a) do magic to update package data or (b) abstract out the portions of the library that operate on package resources so that they can work with a plain directory in the filesystem as well.

aparrish commented 9 years ago

now that there's a note about this in the documentation I'm going to close this issue.