allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.66k stars 223 forks source link

Add a note about make_serializable argument #484

Closed JohnGiorgi closed 11 months ago

JohnGiorgi commented 1 year ago

By default the abbreviation detector pipe is not serializable, so you run into issues when you try to serialize any docs processed with it:

from spacy.tokens import DocBin

nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("abbreviation_detector")

doc_bin = DocBin(store_user_data=True)
doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily.")
doc_bin.add(doc)
# Throws an error: TypeError: can not serialize 'spacy.tokens.span.Span' object

It took me a while to figure out this is easily solved with the make_serializable parameter, but it's not documented anywhere so I am proposing to add a short note in the readme about it.

JohnGiorgi commented 1 year ago

Also worth asking, is there any reason for make_serializable not to default to True?

MichalMalyska commented 11 months ago

Also worth asking, is there any reason for make_serializable not to default to True?

It was added while there was a lot of weirdness with multiprocessing in spacy (https://github.com/allenai/scispacy/pull/368) so I just set it to be false by default just in case if I remember correctly. This is a very funny place to meet again @JohnGiorgi btw.

JohnGiorgi commented 11 months ago

Also worth asking, is there any reason for make_serializable not to default to True?

It was added while there was a lot of weirdness with multiprocessing in spacy (#368) so I just set it to be false by default just in case if I remember correctly. This is a very funny place to meet again @JohnGiorgi btw.

Haha good to hear from you again @MichalMalyska!