LorenzoNorcini / Clickbait-Detector

Clickbait titles detection using SVM
6 stars 0 forks source link

Error while run train #1

Open carlos19silva94 opened 7 years ago

carlos19silva94 commented 7 years ago

I've just cloned the project, and ran this command. I don't want to config to reddit, just want to add new phrases to test. But its not possible

$ python train.py 
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    import dataset_builder
  File "/home/vagrant/tese/svm/Clickbait-Detector/dataset_builder.py", line 1, in <module>
    import praw
ImportError: No module named praw

If i run a test...

$ python predict.py "this is a test headline"
Model not present, run train.py first
Traceback (most recent call last):
  File "predict.py", line 31, in <module>
    print "headline is " + str(int(predict([sys.argv[1]])*100)) + "% likely to be clickbait"
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Any solution?

LorenzoNorcini commented 7 years ago

You can install praw even if you don't use it. Alternatively you can remove the following lines: import dataset_builder and dataset_builder.download_reddit_news_data() from train.py.

carlos19silva94 commented 7 years ago

I still got some errors: Traceback (most recent call last): File "train.py", line 23, in <module> with open('wordsEn.txt') as f: IOError: [Errno 2] No such file or directory: 'wordsEn.txt'

i created that file and copied the content from clickbait.dat and then got this

`Processing dataset

Train size: 12336
Validazion size: 1449
Test size: 723
Traceback (most recent call last):
  File "train.py", line 55, in <module>
    train_set = vectorizer.fit_transform(train_set, train_labels)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 1352, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
    self.fixed_vocabulary_)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 762, in _count_vocab
    for feature in analyze(doc):
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 241, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "train.py", line 13, in tokenize
    tokens = nltk.word_tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 130, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 96, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 814, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 932, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 653, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/vagrant/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
`

PS - if you have other contact method better than this issues panel :)

LorenzoNorcini commented 7 years ago

Like it says in the error you first have to run nltk.download() from the python console.

carlos19silva94 commented 7 years ago

I have downloaded that, but now

`$ python train.py 

Processing dataset

Train size: 12336
Validazion size: 1449
Test size: 723
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:130: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.indices) != 1 or np.rank(self.indptr) != 1:
/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py:200: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.row) != 1 or np.rank(self.col) != 1:

Train matrix shape: (12336, 28950)
Validation matrix shape: (1449, 28950)
Test matrix shape: (723, 28950)

Fitting Model
Killed`

and i can't find a place to replace rank by ndim