chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

textacy.io.spacy.read_spacy_docs() gives key error when iterated over #228

Closed judahrand closed 5 years ago

judahrand commented 5 years ago

Expected Behavior

Spacy Docs should be correctly read into memory.

Current Behavior

docs = read_spacy_docs('corpus', format='binary', lang='en_core_web_sm')
next(docs)

Traceback (most recent call last):

  File "<ipython-input-17-76336de71ac8>", line 1, in <module>
    next(docs)

  File "/home/judah.rand@fospha.local/anaconda3/envs/clickz/lib/python3.6/site-packages/textacy/io/spacy.py", line 93, in read_spacy_docs
    text = msg["text"]
KeyError: 'text'

Possible Solution

I believe the issue is occurring because msgpack is read the dictionary keys back in as byte arrays and the code in textacy.io.spacy.read_spacy_docs() is using default strings.

Steps to Reproduce (for bugs)

Context

Means I can't read my corpus back into memory with using altered version of textacy.io.spacy.read_spacy_docs()

Your Environment