cemoody / lda2vec

MIT License
3.15k stars 627 forks source link

IndexError: Error calculating span: Can't find end #53

Open dbl001 opened 7 years ago

dbl001 commented 7 years ago

Running on OX X 10.11.6 $ python --version Python 2.7.11 :: Anaconda custom (x86_64)

$ python preprocess.py Traceback (most recent call last): File "preprocess.py", line 47, in merge=True) File "build/bdist.macosx-10.5-x86_64/egg/lda2vec/preprocess.py", line 78, in tokenize

Chop timestamps into days

File "spacy/tokens/span.pyx", line 65, in spacy.tokens.span.Span.len (spacy/tokens/span.cpp:3955) File "spacy/tokens/span.pyx", line 130, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5105) IndexError: Error calculating span: Can't find end

Related to: https://github.com/cemoody/lda2vec/issues/38

dbl001 commented 7 years ago

IndexError Traceback (most recent call last)

in () 45 texts = features.pop('comment_text').values 46 tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, ---> 47 merge=True) 48 del texts 49 /Users/davidlaxer/anaconda/lib/python2.7/site-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in tokenize(texts, max_length, skip, attr, merge, nlp, **kwargs) 76 for phrase in doc.noun_chunks: 77 # Only keep adjectives and nouns, e.g. "good ideas" ---> 78 while len(phrase) > 1 and phrase[0].dep_ not in bad_deps: 79 phrase = phrase[1:] 80 if len(phrase) > 1: /Users/davidlaxer/anaconda/lib/python2.7/site-packages/spacy-1.7.3-py2.7-macosx-10.5-x86_64.egg/spacy/tokens/span.pyx in spacy.tokens.span.Span.__len__ (spacy/tokens/span.cpp:3955)() 63 64 def __len__(self): ---> 65 self._recalculate_indices() 66 if self.end < self.start: 67 return 0 /Users/davidlaxer/anaconda/lib/python2.7/site-packages/spacy-1.7.3-py2.7-macosx-10.5-x86_64.egg/spacy/tokens/span.pyx in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5105)() 128 end = token_by_end(self.doc.c, self.doc.length, self.end_char) 129 if end == -1: --> 130 raise IndexError("Error calculating span: Can't find end") 131 132 self.start = start IndexError: Error calculating span: Can't find end
dbl001 commented 7 years ago

Seems to work with merge=False:

tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, merge=False)

preprocess.py: line 46

crawfordcomeaux commented 7 years ago

I've run into similar issues (or the same issue) where merge=False resolves things, but what impact does that have on the results besides squashing the error?

AdrianTudC commented 7 years ago

The merge option seems to merge nouns with other words into single tokens. I don't really think that it affects the shape of topics too much as LDA should be able to handle words by themselves anyway.

fivejjs commented 7 years ago

I got the same issue. It could be solved by setting the "merge" option to "False".

tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4,                                                                                          
                                    merge=False) ##!!!!change here into False 
Aravinviju commented 6 years ago

Hi I am just trying by giving 'merge=False'! May I know how much time will it take to run the 'tokenize' function?

Cheers Arav

Aravinviju commented 6 years ago

Hi all

After I changed the 'merge = false', it is giving me the following error,

OverflowErrorTraceback (most recent call last)

in () 45 texts = features.pop('comment_text').values 46 tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, ---> 47 merge=False) 48 del texts 49 /usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in tokenize(texts, max_length, skip, attr, merge, nlp, **kwargs) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} 107 vocab[skip] = '' 108 return data, vocab /usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in ((v,)) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} 107 vocab[skip] = '' 108 return data, vocab vocab.pyx in spacy.vocab.Vocab.__getitem__() OverflowError: can't convert negative value to uint64_t any heads up on this? kindly help me out with this. cheers Arav
AdrianTudC commented 6 years ago

You need to run python x64 and libs also on x64.

fathia-ghribi commented 3 years ago

Hi all

After I changed the 'merge = false', it is giving me the following error,

OverflowErrorTraceback (most recent call last) in () 45 texts = features.pop('comment_text').values 46 tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, ---> 47 merge=False) 48 del texts 49

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in tokenize(texts, maxlength, skip, attr, merge, nlp, **kwargs) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower for v in uniques if v != skip} 107 vocab[skip] = '' 108 return data, vocab

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in ((v,)) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} 107 vocab[skip] = '' 108 return data, vocab

vocab.pyx in spacy.vocab.Vocab.getitem()

OverflowError: can't convert negative value to uint64_t

any heads up on this? kindly help me out with this.

cheers Arav

i'm getting this error too when i try to run preprocess.py , how to fix this ??