explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
726 stars 60 forks source link

Mutli process doesn't work #34

Closed LasershowJack closed 1 year ago

LasershowJack commented 4 years ago

spaCy version: 2.2.4 spacy-stanza version: 0.2.1 stanza version: 1.0.1

It is not possible to use multiple processes in the pipeline while using the Russian model.

import spacy
from spacy_stanza import StanzaLanguage

stanza.download("ru")

snlp = stanza.Pipeline(lang="ru")
ru_nlp = StanzaLanguage(snlp)

text = ["это какой-то русский текст"] * 100

for doc in ru_nlp.pipe(text, batch_size=50, n_process=2):
    print(doc.is_parsed)

While running the example with n_process=1 it works, however with n_process greater than 1 nothing gets printed, no errors and script doesn't terminate.

adrianeboyd commented 4 years ago

I think this is related to https://github.com/explosion/spaCy/issues/4667, which turns out to be a problem with multiprocessing and pytorch. See my comment here: https://github.com/explosion/spaCy/issues/4667#issuecomment-557470711

The following works for me:

import spacy
import stanza
from spacy_stanza import StanzaLanguage
import torch

torch.set_num_threads(1)

snlp = stanza.Pipeline(lang="en")
nlp = StanzaLanguage(snlp)

text = ["This is a sentence."] * 100

for doc in nlp.pipe(text, batch_size=50, n_process=2):
    print(doc.is_parsed)
BramVanroy commented 3 years ago

Am update with respect to spaCy v3 with the following snippet.

import spacy_stanza

nlp = spacy_stanza.load_pipeline("en")
import torch

torch.set_num_threads(1)

text = ["This is a sentence."] * 100

for doc in nlp.pipe(text, batch_size=50, n_process=2):
    print(doc.is_parsed)

This works on Linux but getting 100 warnings:

[2021-06-29 13:33:19,533] [WARNING] [W109] Unable to save user hooks while serializing the doc. Re-add any required user hooks to the doc after processing.

On Windows this does not work at all due to pickling errors. That is perhaps to be expected due to the age-old spawn vs fork of new processes.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1484, in pipe
    for doc in docs:
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1520, in _multiprocessing_pipe
    proc.start()
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CNNClassifier.__init__.<locals>.<lambda>'

I fear the Windows issue cannot be solved easily, but what about the warning in Linux. Can we do anything about that?

adrianeboyd commented 3 years ago

Hi, it looks like the pickle issue is coming from stanza, not spacy-stanza, so there's nothing we can do directly here.

If you're not using the vector hooks, you can filter/ignore that warning. It's currently a logger warning in v3.0, but in v3.1 we're going back to 100% python warnings because it turned out that also using logger warnings made everything even harder to filter/configure. We were hoping that using the logger would make filtering easier, but it turned out to be a headache in terms of warnings in particular. There will also be a new helper function to make it easier to filter warnings and some automatic "once" filtering of annoying warnings, see: #7807, #8385

BramVanroy commented 3 years ago

I think it's not even a stanza issue, but a torch-on-Windows issue. (Similar to how multiprocessing dataloaders are an issue on that platform.) That being said, I don't think this is more a spacy-stanza problem than a stanza problem: spaCy's nlp.pipe tries to spawn new processes (not stanza), which then leads to the pickling error. But as I said: more a torch issue then anything else.

Okay, interesting to see that the warnings did not turn out as expected. Will the errors also be removed? I use those in my library, so it would be good to know beforehand if they will end up stop working.

adrianeboyd commented 3 years ago

The error makes it look like a stanza issue, at least the initial problem. I don't know if it would work if you fixed the unpickleable lambdas, though, or if you'd run into further issues. We've had to modify similar code in spacy for multiprocessing with spawn.

The errors aren't going to change, it's only related to warnings.

BramVanroy commented 3 years ago

Yes, I understand that the un-picklable object is in stanza. But what I wanted to say is that I would understand if the people over at stanza say "well, in our library we never need to run multiprocessing this way, so not our problem", i.e. the issue is "caused" by us trying to run MP on code that was not intended for that. But from prior experience they're a lot nicer than that, so I'll open an issue over there and maybe they have the time to look into that. If you any more information about solving the lambdas, that'd be helpful - but I assume it's just that lambda x: do(x) is not pickable.

Okay, thanks for the information concerning errors.

Okay to leave this open until I get some info back from the Stanford people?

BramVanroy commented 3 years ago

The Windows specific issue that I highlighted has been fixed in current dev branch of stanza. Should be part of the next release (1.3.0).

For me this topic can be closed.

joakimwar commented 3 years ago

Am update with respect to spaCy v3 with the following snippet.

import spacy_stanza

nlp = spacy_stanza.load_pipeline("en")
import torch

torch.set_num_threads(1)

text = ["This is a sentence."] * 100

for doc in nlp.pipe(text, batch_size=50, n_process=2):
    print(doc.is_parsed)

This works on Linux but getting 100 warnings

This does not work for me on macOS.

adrianeboyd commented 3 years ago

@joakimwar : Please open a new issue with more details.

adrianeboyd commented 1 year ago

Just going through some older issues, and it sounds like the original issue with multiprocessing was resolved.

Please feel free to reopen if you're still running into issues!