Closed aeneaswiener closed 7 years ago
Closing this and making #1045 the master issue. Work in progress for spaCy v2.0!
Dears, Is this issue resolved with the release of spacy 2.0. How can I use spacy in Spark?
Thanks for your help. ea.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
For spaCy to work out of the box with Apache Spark the language modles need to be pickled so that they can be initialised on the master node and then sent to the workers.
This currently doesn't work with plain pickle, failing as follows:
Apache Spark ships with a package called cloudpickle which is meant to support a wider set of Python constructs, but serialisation with cloudpickle also fails resulting in a segmentation fault:
By default Apache Spark uses pickle, but can be told to use cloudpickle instead.
Currently a feasable workaround is lazy loading of the language models on the worker nodes:
The above works. Nevertheless, I wonder if it would be possible to make the English() object pickleable? If not too difficult from your end, having the language models pickleable would provide a better out of box experience for Apache Spark users.