Closed AmyOlex closed 4 years ago
There were a few other nltk downloads we need, but I think we can get away with using python -m nltk.downloader stopwords punkt averaged_perceptron_tagger wordnet
which is only ~100MB instead of 2-3GB for the full download.
@AmyOlex You can test this by running a fresh docker container. Running locally will be polluted because nltk downloads data to an AppData directory on your machine rather than just storing it in the virtual environment.
switching to SpaCy
We had discussed the NLTK installation is taking a long time. I checked, and we really only need one small file from the stopwords corpus. Thus, I think we can just not use this downloader at all and just include the default NLTK stopwords file in the TopExApp or TopEx package. The Stopwords corpus is #70 in this list: http://www.nltk.org/nltk_data/