Closed cmaimone closed 2 months ago
I did install spacy and downloaded the language model, but I'm still getting the error message - probably because I installed qc via pipx and qc is running in a different environment than my base python installation. I'm not sure actually how I'd add spacy to the environment that qc is running in
The spacy library is installed during qc's installation, but the language model is not--spacy's authors strongly discourage automatic installation of models, so I defer to them and require the user to install a language model manually when it is needed.
I had a similar issue of installing the model and then having it not found... here's a debugging thread, including for the case that appears to be yours, of running within conda: conda install -c conda-forge spacy-model-en_core_web_sm
Any luck?
No - I think the issue is that qc is installed via pipx, and that creates a whole separate Python environment. So Python running in that environment doesn't know to look in the places that the spacy models would be when I install them outside of that environment, with pip or conda. For example, when I install the language model, it ends up in: /opt/anaconda3/lib/python3.11/site-packages/en_core_web_sm/en_core_web_sm-3.7.1
but it should probably be in something like ~/.local/pipx/venvs/qualitative-coding/lib/python3.11/site-packages/spacy/
. So I think either you'd need to use an environment variable in qc to look for the library in a specific place (and also direct people to download it there), or I think you can make it a dependency in pyproject.toml, which would then get pipx to include it. See for example https://github.com/explosion/spaCy/discussions/12399 or https://stackoverflow.com/questions/76314229/how-to-download-spacy-models-in-a-poetry-managed-environment
Ooh, thanks! I will see if I can get this to work... I'd prefer not to have to prompt the user to download a model.
I think the idea would be for qc
to lazily download the model when it's first needed. This would reduce the package size for users who don't use NLP features, and would also allow model specification in the settings file.
I should be able to provide a new release and follow up on open issues toward the end of the week... gotta get my fall courses launched!
I have just released 1.5.2, which uses the following strategy:
qc corpus anonymize
is invoked. The download strategy uses the spacy package's built-in functionality, maximizing compatibility. I have tested this in both a project virtual environment and the pipx environment. After trying a variety of options, this approach feels best to me. Here are my rationales for rejecting alternatives:
qc
in the future, and this will probably involve allowing the user to select the local or remote language model they want to use. The more accurate spacy model, en_core_web_trf
, is almost 500 mb, so automatically installing models will become less and less attractive. pyproject.toml
. I tried to do this repeatedly, and couldn't find a way to get PyPI to accept the package. In the end, it feels like the wrong approach anyway, as I would be duplicating functionality (downloading and installing models) which is already provided by spacy, and which is properly the responsibility of spacy. I'm reluctant to take on the maintenance and support burden of this redundant functionality. Does this feel like a reasonable approach? If so, I believe this issue can be closed. Thank you again for the work you've put into this review.
I didn't see anywhere where the need for spacy was noted in the documentation - until I tried to use anonymize. The message points to what is needed, but if someone doesn't have spacy already installed, then the suggestion won't work:
re: https://github.com/openjournals/joss-reviews/issues/7031