Open carlos19silva94 opened 7 years ago
You can install praw even if you don't use it.
Alternatively you can remove the following lines:
import dataset_builder
and
dataset_builder.download_reddit_news_data()
from train.py.
I still got some errors:
Traceback (most recent call last): File "train.py", line 23, in <module> with open('wordsEn.txt') as f: IOError: [Errno 2] No such file or directory: 'wordsEn.txt'
i created that file and copied the content from clickbait.dat and then got this
`Processing dataset
Train size: 12336
Validazion size: 1449
Test size: 723
Traceback (most recent call last):
File "train.py", line 55, in <module>
train_set = vectorizer.fit_transform(train_set, train_labels)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 1352, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 762, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 241, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "train.py", line 13, in tokenize
tokens = nltk.word_tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 130, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 96, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 814, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 932, in _open
return find(path_, path + ['']).open()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 653, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/vagrant/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
`
PS - if you have other contact method better than this issues panel :)
Like it says in the error you first have to run nltk.download() from the python console.
I have downloaded that, but now
`$ python train.py
Processing dataset
Train size: 12336
Validazion size: 1449
Test size: 723
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:130: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
if np.rank(self.data) != 1 or np.rank(self.indices) != 1 or np.rank(self.indptr) != 1:
/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py:200: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
if np.rank(self.data) != 1 or np.rank(self.row) != 1 or np.rank(self.col) != 1:
Train matrix shape: (12336, 28950)
Validation matrix shape: (1449, 28950)
Test matrix shape: (723, 28950)
Fitting Model
Killed`
and i can't find a place to replace rank by ndim
I've just cloned the project, and ran this command. I don't want to config to reddit, just want to add new phrases to test. But its not possible
If i run a test...
Any solution?