Add dependencies for Library NLTK

QilongChan commented 5 years ago

Expected Behavior

API methods of NLTK need dependencies (listed below). This can be done by commands:

import nltk
nltk.download('all')

The details of dependencies:

averaged_perceptron_tagger
averaged_perceptron_tagger_ru
maxent_treebank_pos_tagger
universal_tagset
porter_test
rslp
vader_lexicon
bllip_wsj_no_aux
moses_sample
wmt15_eval
word2vec_sample
mwa_ppdb
perluniprops
tarsets
basque_grammars
book_grammars
large_grammars
sample_grammars
spanish_grammars
maxent_ne_chunker
abc
alpino
biocreative_ppi
brown
brown_tei
cess_cat
cess_esp
chat80
city_database
cmudict
comparative_sentences
conll2000
conll2002
crubadan
dependency_treebank
dolch
europarl_raw
floresta
framenet_v15
framenet_v17
gazetteers
genesis
gutenberg
ieer
inaugural
indian
kimmo
lin_thesaurus
mac_morpho
movie_reviews
mte_teip5
names
nonbreaking_prefixes
nps_chat
omw
opinion_lexicon
paradigms
pe08
pil
pl196x
ppattach
problem_reports
product_reviews_1
product_reviews_2
pros_cons
ptb
qc
rte
senseval
sentence_polarity
sentiwordnet
shakespeare
sinica_treebank
smultron
state_union
stopwords
subjectivity
swadesh
switchboard
timit
toolbox
treebank
twitter_samples
udhr
udhr2
unicode_samples
verbnet
verbnet3
webtext
wordnet
wordnet_ic
words
ycoe

Actual Behavior

Potential Solution

Reproducing the Problem

System Information

Checklist

[x] I have completely filled out this template
[x] I have confirmed that this issue exists on the current master branch
[x] I have confirmed that this is not a duplicate issue by searching issues
[x] I have provided detailed steps to reproduce the issue

AlexCatarino commented 5 years ago

If we add all models

nltk.download('all')

against only punkt

nltk.download('punkt')

used in the example in #3370 , the docker image is 25% bigger (12Gb vs 9Gb), so we should evaluate the need of adding each model.

jaredbroad commented 5 years ago

@AlexCatarino will punkt alone serve +90% of user needs? If not; what combination of dependencies will achieve 90%+? Otherwise, we'll be back here in 2months adding another dependency =)

While here should also add "OpenNLP" for C# algorithms.

QilongChan commented 5 years ago

For the data I've downloaded locally, it is the corpora (text data set) that takes 80% of the space (2.8G over 3.2G). But for building the models, since the data there is mostly unrelated to finance, I think only punkt would be enough if someone really wants to use this package to build an algorithm.

QuantConnect / Lean