explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Doc won't serialize with custom attribute #13281

Closed avivbrokman closed 8 months ago

avivbrokman commented 8 months ago

How to reproduce the behaviour

I am trying to use Doc.to_bytes() after extending Doc with a custom attribute. I can successfully serialize and deserialize the custom attribute on its own, but this fails with Doc.to_bytes(). Here's a minimal reproducible example:

import spacy
from spacy.tokens import Doc

nlp = spacy.blank('en')

def serialize_spans(obj, attr):
    return [(span.start_char, span.end_char) for span in getattr(obj._, attr)]

def deserialize_spans(obj, attr):
    setattr(obj._, attr, [obj.char_span(start, end) for start, end in value])

Doc.set_extension("special_spans", default = list(), to_bytes = serialize_spans, from_bytes = deserialize_spans)

doc = nlp('The quick brown fox jumped over the lazy dog.')
doc._.special_spans = [doc[0:2], doc[4:6]]

# Works well
serialize_spans(doc, 'special_spans')

# Doesn't work
doc.to_bytes()

Your Environment

svlandeg commented 8 months ago

Hi! Let me move this to the discussion forum and follow up with you there.