aparrish / pycorpora

A simple Python interface for Darius Kazemi's Corpora Project.
MIT License
119 stars 24 forks source link

installation problems with pip 7+ #8

Open aparrish opened 8 years ago

aparrish commented 8 years ago

After upgrading pip to the newest version, installation of pycorpora fails. Or, more specifically: the library installs fine, but the data files are missing. After some investigation, it appears that in recent versions of pip, packages downloaded from PyPI are locally cached as wheels when first installed; subsequent installations of cached packages circumvent the build process entirely. Right now, setup.py downloads and installs the corpora project data as part of the build process; if the build process doesn't run, no data is downloaded, and so you'll have sessions that look like:

>>> import pycorpora
>>> pycorpora.words
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'words'

It looks like you can tell pip to not use pre-cached wheels by invoking the command like so:

pip install --no-cache-dir pycorpora

... so that's the short-term workaround. I'm not sure what a more permanent fix would be; it would probably take one of the following forms:

aparrish commented 8 years ago

There might also be some magic flag I can set somewhere in the tool chain that tells pip or pypi or someone to never try to make a wheel for this package, even locally for caching purposes. I can't find such a thing right now, but that's another avenue of investigation.

fitnr commented 8 years ago

I think that the first method is used in the moviepy package. When you import it, it checks if a certain utility is available. If not, it downoads it.

aparrish commented 8 years ago

@fitnr hmm, that is definitely another option. I'm not a huge fan of that approach because making a network request on import seems like unexpected behavior and means that your program could randomly fail or take a long time when it first launches, depending on network availability. Plus, if you wanted to make an application could be used offline, you'd have to make sure to import the library first before you packaged it up. shrug

aparrish commented 8 years ago

Another option would be to simply include the corpora project data as part of the package source, then periodically release new versions of the package with updated data. (pro: simple; con: constantly answering the question, "why can't I use the data I just pull-requested into the corpora project with pycorpora?" etc)

fitnr commented 8 years ago

I agree about auto-downloading on import. Are github bots a thing? Changed corpora -> checked into pycorpora -> deploy

fitnr commented 8 years ago

The benefit of writing more software to deploy software is that it's a potentially an infinite loop :)

leonardr commented 6 years ago

After talking with Allison about this problem, I think it would be useful to include a version of the corpora zip without worrying too much about keeping it updated. Even an old version will satisfy most people.

On top of that we can add a downloader for mirroring the most up-to-date version to a specific directory, and loading corpora from a directory. Each installation would need to decide which directory was right for it.

And on top of that we could adapte the nltk implementation of default_download_dir (http://www.nltk.org/_modules/nltk/downloader.html) to pick a good default download directory. (nltk is Apache licensed.)

aparrish commented 6 years ago

Thinking about it a bit more after your pull request, I've been mentally leaning toward just taking the corpora data completely out of the hands of the module and having an init() function or similar that takes a path to a copy of corpora that has already been downloaded (either as the release ZIP file from the repo or as a clone repo, using e.g. pyfilesystem to abstract across the two).

I'm wary of engineering a situation where what's in this module and what's in the official repo are different, since the easiest way to browse corpora is poking through the GitHub repo, and I'm anticipating a bunch of GitHub issues where people will be like "I'm getting a file not found error for ((file just added to corpora yesterday)), what gives?" Forcing the user to download the files has the benefit of being explicit, admittedly with the drawback of requiring an extra step.

In my head I'm optimizing for two scenarios: first, the afternoon workshop tutorial, and second, the situation in which I've just submitted a pull request to Darius and want to use what I submitted immediately after the PR is accepted. In the former case, the workflow from my proposed implementation is a bit more complicated than the ideal, but still seems pretty simple; you just need to do something like

!curl -L -O https://github.com/dariusk/corpora/archive/master.zip
import pycorpora
pycorpora.init("master.zip")

In the second scenario, you just need to merge upstream into your own local fork and then:

import pycorpora
pycorpora.init("path/to/your/repo")

Or even something radical, like

from pycorpora import Corpora
corpora = Corpora("path to repo or zip")

... moving the functionality of the module into a class, which would have the additional (speculative) benefit of being able to use two different copies of corpora at once. Unfortunately any of these scenarios (aside from just including a copy of the corpora data in this repo) basically mean rewriting the module from scratch. shrug

leonardr commented 6 years ago

I'm going to try including a copy of corpora in olipy. I'll change my corpus-loading API to be compatible with pycorpora, so that when we come up with a better solution I can switch over easily.