chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Unable to save corpus with custom extension attributes #254

Closed nyejon closed 4 years ago

nyejon commented 5 years ago

Hi Burton

Sorry, it won't let me comment on the other issue #252 or update it?

It is a problem when there are custom attributes on tokens. This should provide a minimum way to show the error.

from spacy.tokens import Token
import spacy
import textacy

text_en = (
    ("Since the so-called \"statistical revolution\" in the late 1980s and mid 1990s, ")
)

Token.set_extension('is_attribute', default=False)

def set_token_attribute(doc):
    for token in doc:
        token._.set('is_attribute', True)
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(set_token_attribute)

corpus = textacy.Corpus(nlp, data=text_en)
corpus.save(f'corpus_en.bin.gz')
new_corpus = textacy.Corpus.load(lang=nlp, filepath=f'corpus_en.bin.gz')
gustavengstrom commented 5 years ago

I think this is indeed a bug. In corpus.py the load method (line 599). I changed

msg = srsly.msgpack_loads(f.read())

to

msg = srsly.msgpack_loads(f.read(), use_list=False)

This solved it!

bdewilde commented 4 years ago

Heads-up, I have a PR open that will fix this issue: https://github.com/chartbeat-labs/textacy/pull/285

Thanks for your patience, I had to take a longer-than-expected break from textacy development to work on other projects. Glad to be back at it. :+1:

bdewilde commented 4 years ago

That update is now available in a release: https://github.com/chartbeat-labs/textacy/releases/tag/0.10.0