NLTK downloader not needed

VCUWrightCenter / TopExApp

TopExApp is a graphical user interface for the TopEx Python package. TopEx allows the exploration of topics present in a group of text documents by clustering sentences together that relay common ideas or themes.

GNU General Public License v3.0

8 stars 0 forks source link

NLTK downloader not needed #24

Closed AmyOlex closed 4 years ago

AmyOlex commented 4 years ago

We had discussed the NLTK installation is taking a long time. I checked, and we really only need one small file from the stopwords corpus. Thus, I think we can just not use this downloader at all and just include the default NLTK stopwords file in the TopExApp or TopEx package. The Stopwords corpus is #70 in this list: http://www.nltk.org/nltk_data/

etfrenchvcu commented 4 years ago

There were a few other nltk downloads we need, but I think we can get away with using python -m nltk.downloader stopwords punkt averaged_perceptron_tagger wordnet which is only ~100MB instead of 2-3GB for the full download.

etfrenchvcu commented 4 years ago

@AmyOlex You can test this by running a fresh docker container. Running locally will be polluted because nltk downloads data to an AppData directory on your machine rather than just storing it in the virtual environment.

AmyOlex commented 4 years ago

switching to SpaCy