0xyd / sec2vec

A embedding method for Cyber Threat Intelligence
MIT License
18 stars 4 forks source link

SentenceIterator is exhausted after one loop #27

Closed hannahxchen closed 5 years ago

hannahxchen commented 5 years ago

The SentenceIterator is exhausted after one loop. As the model starts to train, the log shows that " took 0.0s, 0 effective words/s" in every epoch.

Reference: Data streaming in Python: generators, iterators, iterables

Code:

from embedding import SecWord2Vec

sentences = [['hello','world','test'],['another','sentence','test']]
keywords = ['hello', 'world']

model = SecWord2Vec(keywords, sentences, min_count=1, size=10, iter=5)
model.train_embed()

Output logs: (with import logging)

Epoch Logger is prepared.
2018-12-01 15:25:12,941 - gensim.models.base_any2vec - WARNING - consider setting layer size to a multiple of 4 for greater performance
2018-12-01 15:25:12,942 - gensim.models.word2vec - INFO - collecting all words and their counts
2018-12-01 15:25:12,942 - gensim.models.word2vec - INFO - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-12-01 15:25:12,942 - gensim.models.word2vec - INFO - collected 5 word types from a corpus of 6 raw words and 2 sentences
2018-12-01 15:25:12,942 - gensim.models.word2vec - INFO - Loading a fresh vocabulary
2018-12-01 15:25:12,942 - gensim.models.word2vec - INFO - effective_min_count=1 retains 5 unique words (100% of original 5, drops 0)
2018-12-01 15:25:12,942 - gensim.models.word2vec - INFO - effective_min_count=1 leaves 6 word corpus (100% of original 6, drops 0)
2018-12-01 15:25:12,942 - gensim.models.word2vec - INFO - deleting the raw counts dictionary of 5 items
2018-12-01 15:25:12,944 - gensim.models.word2vec - INFO - sample=0.001 downsamples 5 most-common words
2018-12-01 15:25:12,944 - gensim.models.word2vec - INFO - downsampling leaves estimated 0 word corpus (7.5% of prior 6)
2018-12-01 15:25:12,944 - gensim.models.base_any2vec - INFO - estimated required memory for 5 words and 10 dimensions: 2900 bytes
2018-12-01 15:25:12,944 - gensim.models.word2vec - INFO - resetting layer weights
>>> model.train_embed()
2018-12-01 15:25:31,215 - gensim.models.base_any2vec - INFO - training model with 8 workers on 5 vocabulary and 10 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
Epoch #0 start
2018-12-01 15:25:31,218 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 7 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 6 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 5 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 4 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 3 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 2 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 1 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 0 more threads
2018-12-01 15:25:31,219 - gensim.models.base_any2vec - INFO - EPOCH - 1 : training on 6 raw words (0 effective words) took 0.0s, 0 effective words/s
Epoch #0 end - training loss: 0.0
Epoch #1 start
2018-12-01 15:25:31,220 - gensim.models.base_any2vec - WARNING - train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable).
2018-12-01 15:25:31,221 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 7 more threads
2018-12-01 15:25:31,221 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 6 more threads
2018-12-01 15:25:31,221 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 5 more threads
2018-12-01 15:25:31,221 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 4 more threads
2018-12-01 15:25:31,221 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 3 more threads
2018-12-01 15:25:31,222 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 2 more threads
2018-12-01 15:25:31,222 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 1 more threads
2018-12-01 15:25:31,222 - gensim.models.base_any2vec - INFO - worker thread finished; awaiting finish of 0 more threads
2018-12-01 15:25:31,222 - gensim.models.base_any2vec - INFO - EPOCH - 2 : training on 0 raw words (0 effective words) took 0.0s, 0 effective words/s
2018-12-01 15:25:31,222 - gensim.models.base_any2vec - WARNING - EPOCH - 2 : supplied example count (0) did not equal expected count (2)
Epoch #1 end - training loss: 0.0
hannahxchen commented 5 years ago
screen shot 2018-12-01 at 3 39 47 pm
hannahxchen commented 5 years ago

Solution:

screen shot 2018-12-01 at 10 26 15 pm
0xyd commented 5 years ago
  1. Input sentences with List object does not cause iterable exhausted.
  2. If the input sentences is a generator, exhausted situation still occur.