Error when trying to use `nlp.pipe` with `n_process` > 1

DayalStrub commented 3 years ago

Intro

I am getting TypeError: can not serialize 'BaseTextRank' object when trying to use spaCy's multiprocessing in nlp.pipe with a textrank pipeline component.

Sorry if this a known/expected feature/limitation - I couldn't find anything by searching repo. I generally find (spaCy's) multiprocessing a bit temperamental anyhow, but this seems to just not work.

PS. thanks for all the great work on the package!

Environment

Ubuntu 18.X (AWS DL AMI), Python 3.8 (via conda/mamba), pytextrank installed via pip, thtough conda - do let me know if you need more info.

Reproducible example - hopefullly

import spacy
import pytextrank

import en_core_web_sm

nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);

txt = """
The Old Testament of the King James Bible
The First Book of Moses:  Called Genesis
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.
1:5 And God called the light Day, and the darkness he called Night.
And the evening and the morning were the first day.
...
"""

data = []
for i in range(50):
    data.append((txt, {"doc_id": i}))

keys = []

for doc, context in nlp.pipe(data, as_tuples=True, n_process=-1): ## NOTE throws error, but hangs. work with n_process=1
    out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
    keys.append(out)
# pd.DataFrame(keys).head()

keys

ceteri commented 3 years ago

Thank you @DayalStrub - This is good. I don't recall that we've had any cases using the multi-processor option in spaCy previously.

To confirm, when running Language.pipe() with a number of processors other than the default 1 value,

import pytextrank
import spacy
import en_core_web_sm

txt = """To return to my trees. This, as you know, is something that I do often. But sometimes, I even surprise myself with how powerful the pull of trees can be. Take this latest tree. I walked out onto this huge expanse of hard sand and then headed directly across to where there was this amazing old fir tree whose growth seems to have split the sandstone, its top is blown off, and its roots getting salted with every winter storm. I could not easily capture its grandness in one image so I pieced a few together and relied mostly on a short video for painting references. After all the little plein air paintings, this is my first studio painting from Hornby Island. Well, let’s see what we have shall we?"""

nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);
doc = nlp(txt)

data = [
    (txt, {"doc_id": i})
    for i in range(5)
    ]

## `n_process=-1` throws exception
## `n_process=1` works

for doc, context in nlp.pipe(data, as_tuples=True, n_process=1): 
    out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
    print(out)

Then pytextrank causes an exception to be thrown:

Process Process-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 2007, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 2007, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "spacy/tokens/doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
  File "spacy/tokens/doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/util.py", line 1134, in to_dict
    serialized[key] = getter()
  File "spacy/tokens/doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/srsly/msgpack/__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'BaseTextRank' object

So we need to make the pytextrank base class and subclasses per algorithm to be serializable. This would also be needed if we ever wanted to run distributed, say on a Ray cluster.

ceteri commented 1 year ago

This appears to be happening in several cases in spaCy and some of the GH issues point to using srsly https://github.com/explosion/srsly to resolving serialization issues.

elirannrich commented 1 year ago

any update on this bug ? happy to help if needed

DerwenAI / pytextrank