explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.15k stars 4.4k forks source link

Use in Apache Spark / English() object cannot be pickled #125

Closed aeneaswiener closed 7 years ago

aeneaswiener commented 9 years ago

For spaCy to work out of the box with Apache Spark the language modles need to be pickled so that they can be initialised on the master node and then sent to the workers.

This currently doesn't work with plain pickle, failing as follows:

>>> from __future__ import unicode_literals, print_function
>>> from spacy.en import English
>>> import pickle
>>> nlp = English()
>>> nlpp = pickle.dumps(nlp)
Traceback (most recent call last):
[...]
TypeError: can't pickle Vocab objects

Apache Spark ships with a package called cloudpickle which is meant to support a wider set of Python constructs, but serialisation with cloudpickle also fails resulting in a segmentation fault:

>>> from pyspark import cloudpickle
>>> pickled_nlp = cloudpickle.dumps(nlp)
>>> nlpp = pickle.dumps(nlp)
>>> nlpp('test text')
Segmentation fault

By default Apache Spark uses pickle, but can be told to use cloudpickle instead.

Currently a feasable workaround is lazy loading of the language models on the worker nodes:

global nlp
def lazyloaded_nlp(s):
    global nlp
    try:
        return nlp(s)
    except:
        nlp = English()
        return nlp(s)

The above works. Nevertheless, I wonder if it would be possible to make the English() object pickleable? If not too difficult from your end, having the language models pickleable would provide a better out of box experience for Apache Spark users.

ines commented 7 years ago

Closing this and making #1045 the master issue. Work in progress for spaCy v2.0!

easimadi commented 6 years ago

Dears, Is this issue resolved with the release of spacy 2.0. How can I use spacy in Spark?

Thanks for your help. ea.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.