cproctor / qualitative-coding

Qualitative coding for computer scientists
Other
12 stars 3 forks source link

anonymize and spacy #62

Closed cmaimone closed 2 months ago

cmaimone commented 3 months ago

I didn't see anywhere where the need for spacy was noted in the documentation - until I tried to use anonymize. The message points to what is needed, but if someone doesn't have spacy already installed, then the suggestion won't work:

% qc corpus anonymize
A language model is required to run this task. Please run:
python -m spacy download en_core_web_sm

% python -m spacy download en_core_web_sm
/opt/anaconda3/bin/python: No module named spacy

re: https://github.com/openjournals/joss-reviews/issues/7031

cmaimone commented 3 months ago

I did install spacy and downloaded the language model, but I'm still getting the error message - probably because I installed qc via pipx and qc is running in a different environment than my base python installation. I'm not sure actually how I'd add spacy to the environment that qc is running in

cproctor commented 3 months ago

The spacy library is installed during qc's installation, but the language model is not--spacy's authors strongly discourage automatic installation of models, so I defer to them and require the user to install a language model manually when it is needed.

I had a similar issue of installing the model and then having it not found... here's a debugging thread, including for the case that appears to be yours, of running within conda: conda install -c conda-forge spacy-model-en_core_web_sm

Any luck?

cmaimone commented 3 months ago

No - I think the issue is that qc is installed via pipx, and that creates a whole separate Python environment. So Python running in that environment doesn't know to look in the places that the spacy models would be when I install them outside of that environment, with pip or conda. For example, when I install the language model, it ends up in: /opt/anaconda3/lib/python3.11/site-packages/en_core_web_sm/en_core_web_sm-3.7.1 but it should probably be in something like ~/.local/pipx/venvs/qualitative-coding/lib/python3.11/site-packages/spacy/. So I think either you'd need to use an environment variable in qc to look for the library in a specific place (and also direct people to download it there), or I think you can make it a dependency in pyproject.toml, which would then get pipx to include it. See for example https://github.com/explosion/spaCy/discussions/12399 or https://stackoverflow.com/questions/76314229/how-to-download-spacy-models-in-a-poetry-managed-environment

cproctor commented 3 months ago

Ooh, thanks! I will see if I can get this to work... I'd prefer not to have to prompt the user to download a model.

I think the idea would be for qc to lazily download the model when it's first needed. This would reduce the package size for users who don't use NLP features, and would also allow model specification in the settings file.

I should be able to provide a new release and follow up on open issues toward the end of the week... gotta get my fall courses launched!

cproctor commented 2 months ago

I have just released 1.5.2, which uses the following strategy:

After trying a variety of options, this approach feels best to me. Here are my rationales for rejecting alternatives:

Does this feel like a reasonable approach? If so, I believe this issue can be closed. Thank you again for the work you've put into this review.