chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'." #258

Open radkoff opened 5 years ago

radkoff commented 5 years ago

steps to reproduce

First create the following Corpus, save it to disk, and note that upon reloading you can still get word doc counts:

import textacy
corpus = textacy.Corpus('en', ['Pittsburgh', 'slated for. Stacey designated as moderator'])
corpus.save('foo.textacy')
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())

But then open a new Python shell, load the same corpus from disk, and get an error about a word ID missing from the vocab:

import textacy
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())
Traceback (most recent call last):
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-dff4867a4989>", line 3, in <module>
    print(corpus.word_doc_counts())
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/corpus.py", line 494, in word_doc_counts
    normalize=normalize, weighting="binary", as_strings=as_strings
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/spacier/doc_extensions.py", line 511, in to_bag_of_words
    lex = vocab[wid]
  File "vocab.pyx", line 237, in spacy.vocab.Vocab.__getitem__
  File "lexeme.pyx", line 44, in spacy.lexeme.Lexeme.__init__
  File "vocab.pyx", line 152, in spacy.vocab.Vocab.get_by_orth
  File "strings.pyx", line 138, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'."

context

The particular example above was narrowed down from larger texts, and strangely at this point, it seems like removing any more words causes the bug to go away. Eg, the following all work: ['Pittsburgh', 'slated for. Stacey designated moderator'] ['Pittsburgh', 'slated. Stacey designated as moderator'] ['Pittsburgh', 'for. Stacey designated as moderator'] ['slated for. Stacey designated as moderator'] ['this is doc one', 'this is doc two']

I've run into this with several different corpora (I'm trying to build IDF models).

possible solution?

I'm guessing it has something to do with trying to access the lemmas of words? Maybe the Vocab needs to be serialized along with the docs themselves? https://github.com/explosion/spaCy/issues/2419

environment

radkoff commented 5 years ago

After upgrading textacy and spacy, the error now seems to be intermittent (or maybe it was before?..), so you may have try loading it in a new shell a few times before it fails.