Closed LasershowJack closed 1 year ago
I think this is related to https://github.com/explosion/spaCy/issues/4667, which turns out to be a problem with multiprocessing and pytorch. See my comment here: https://github.com/explosion/spaCy/issues/4667#issuecomment-557470711
The following works for me:
import spacy
import stanza
from spacy_stanza import StanzaLanguage
import torch
torch.set_num_threads(1)
snlp = stanza.Pipeline(lang="en")
nlp = StanzaLanguage(snlp)
text = ["This is a sentence."] * 100
for doc in nlp.pipe(text, batch_size=50, n_process=2):
print(doc.is_parsed)
Am update with respect to spaCy v3 with the following snippet.
import spacy_stanza
nlp = spacy_stanza.load_pipeline("en")
import torch
torch.set_num_threads(1)
text = ["This is a sentence."] * 100
for doc in nlp.pipe(text, batch_size=50, n_process=2):
print(doc.is_parsed)
This works on Linux but getting 100 warnings:
[2021-06-29 13:33:19,533] [WARNING] [W109] Unable to save user hooks while serializing the doc. Re-add any required user hooks to the doc after processing.
On Windows this does not work at all due to pickling errors. That is perhaps to be expected due to the age-old spawn vs fork of new processes.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1484, in pipe
for doc in docs:
File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1520, in _multiprocessing_pipe
proc.start()
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CNNClassifier.__init__.<locals>.<lambda>'
I fear the Windows issue cannot be solved easily, but what about the warning in Linux. Can we do anything about that?
Hi, it looks like the pickle issue is coming from stanza
, not spacy-stanza
, so there's nothing we can do directly here.
If you're not using the vector hooks, you can filter/ignore that warning. It's currently a logger warning in v3.0, but in v3.1 we're going back to 100% python warnings because it turned out that also using logger warnings made everything even harder to filter/configure. We were hoping that using the logger would make filtering easier, but it turned out to be a headache in terms of warnings in particular. There will also be a new helper function to make it easier to filter warnings and some automatic "once" filtering of annoying warnings, see: #7807, #8385
I think it's not even a stanza issue, but a torch-on-Windows issue. (Similar to how multiprocessing dataloaders are an issue on that platform.) That being said, I don't think this is more a spacy-stanza problem than a stanza problem: spaCy's nlp.pipe tries to spawn new processes (not stanza), which then leads to the pickling error. But as I said: more a torch issue then anything else.
Okay, interesting to see that the warnings did not turn out as expected. Will the errors also be removed? I use those in my library, so it would be good to know beforehand if they will end up stop working.
The error makes it look like a stanza
issue, at least the initial problem. I don't know if it would work if you fixed the unpickleable lambdas, though, or if you'd run into further issues. We've had to modify similar code in spacy for multiprocessing with spawn.
The errors aren't going to change, it's only related to warnings.
Yes, I understand that the un-picklable object is in stanza. But what I wanted to say is that I would understand if the people over at stanza say "well, in our library we never need to run multiprocessing this way, so not our problem", i.e. the issue is "caused" by us trying to run MP on code that was not intended for that. But from prior experience they're a lot nicer than that, so I'll open an issue over there and maybe they have the time to look into that. If you any more information about solving the lambdas, that'd be helpful - but I assume it's just that lambda x: do(x)
is not pickable.
Okay, thanks for the information concerning errors.
Okay to leave this open until I get some info back from the Stanford people?
The Windows specific issue that I highlighted has been fixed in current dev branch of stanza. Should be part of the next release (1.3.0).
For me this topic can be closed.
Am update with respect to spaCy v3 with the following snippet.
import spacy_stanza nlp = spacy_stanza.load_pipeline("en") import torch torch.set_num_threads(1) text = ["This is a sentence."] * 100 for doc in nlp.pipe(text, batch_size=50, n_process=2): print(doc.is_parsed)
This works on Linux but getting 100 warnings
This does not work for me on macOS.
@joakimwar : Please open a new issue with more details.
Just going through some older issues, and it sounds like the original issue with multiprocessing was resolved.
Please feel free to reopen if you're still running into issues!
spaCy version: 2.2.4 spacy-stanza version: 0.2.1 stanza version: 1.0.1
It is not possible to use multiple processes in the pipeline while using the Russian model.
While running the example with
n_process=1
it works, however withn_process
greater than 1 nothing gets printed, no errors and script doesn't terminate.