Closed dimitarsh1 closed 3 years ago
Seems to be related to https://github.com/explosion/spaCy/issues/5220.
A quick workaround is to change only the first line:
nlpD = spacy_udpipe.load(lang).tokenizer
.
This should do the trick as it will call UDPipeTokenizer.pipe
(creating a Doc
with the same attributes as Language.pipe
, just bug-free). If you want to use custom pipes afterward, you could call them on the resulting Doc
objects (once created, these are modified in-place anyway), for now.
I will look into a proper fix soon, hopefully.
Great,
Thanks a lot.
Kind regards, Dimitar
On Mon, 11 May 2020, 09:26 asajatovic, notifications@github.com wrote:
Seems to be related to explosion/spaCy#5220 https://github.com/explosion/spaCy/issues/5220.
A quick workaround is to change only the first line: nlpD = spacy_udpipe.load(lang).tokenizer.
This should do the trick as it will call UDPipeTokenizer.pipe https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L141-L158(creating a Doc with the same attributes as Language.pipe, just bug-free). If you want to use custom pipes afterward, you could call them on the resulting Doc objects (once created, these are modified in-place anyway), for now.
I will look into a proper fix soon, hopefully.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TakeLab/spacy-udpipe/issues/21#issuecomment-626552678, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYLMEXDL7HYJFCEDFQEN6DRQ6ZCPANCNFSM4M5KSOXA .
The issue happens with StringStore.add
.
When Language.__call__
is called, a Doc
is created using UDPipeTokenizer.__call__
in which token attributes are added to StringStore
object (see these lines). The same code is run when Language.pipe
is called, but via a custom spaCy multiprocessing code. Now, this is where the havoc begins as Python multiprocessing objects and code interact with SWIG wrapper code for UDPipe model (i.e. UDPipe Python bindings which actually enable the underlying NLP model in C++) and somehow in the end the StringsStore
object does not contain all the string values it should (missing lemmas, tags, dependencies, etc.).
Interestingly, when the same UDPipeLanguage
object processes same texts via __call__
first and then via pipe
, everything works fine as the StringStore
object is already prepopulated with all string values.
Unfortunately, neither multithreading nor UDPipeTokenizer multiprocessing speeds up execution.
Thanks. Could that relate to the other issue "[E190] Token head out of range"?
@dimitarsh1 you are welcome.
[E190]
should not be happening when using __call__
since version 0.2.1
or with pipe
since version 0.3.1
.
Unfortunately, without the exact input that causes this, it is very difficult to conclude anything with certainty.
Related: on the most recent version of spacy_udpipe, pipe
does not work with n_process > 1 on Windows because cannot pickle 'ufal.udpipe.Model' object
. Works fine on Linux by default. Evidently, the spin-off method is crucial here. On Windows we only have "spawn", which is more restrict in terms of pickling compared to "fork". Linux has both but defaults to "fork". If you multiprocessing.set_start_method("spawn")
on Linux, the code will also fail.
Should I make a separate issue for this? Might be difficult to solve this one, though, and perhaps impossible if you have no control over the UDPipe model directly.
Traceback (most recent call last):
File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\Scripts\parse-as-conll-script.py", line 33, in <module>
sys.exit(load_entry_point('spacy-conll', 'console_scripts', 'parse-as-conll')())
File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 179, in main
parse(cargs)
File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 35, in parse
conll_str = parser.parse_file_as_conll(
File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 81, in parse_file_as_conll
return self.parse_text_as_conll(text, **kwargs)
File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 135, in parse_text_as_conll
for doc_idx, doc in enumerate(self.nlp.pipe(text, n_process=n_process)):
File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1484, in pipe
for doc in docs:
File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1520, in _multiprocessing_pipe
proc.start()
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'ufal.udpipe.Model' object
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
Hi @BramVanroy, thanks for reporting. This should be fixed by #39 soon.
Awesome! Thanks.
When running the spacy_udpipe with the n_process = X enabled, it gives an error. The code I run is:
The error is:
When I run the same code, but without the n_process argument, then everything is fine. No errors, the text is processed and so on.
It seems to be related to a spaCy issue, but I couldn't find a solution. https://stackoverflow.com/questions/60152152/spacy-issue-with-vocab-or-stringstore
spaCy version: 2.2.4 python version: 3.8.3 spacy-udpipe version: 0.3.0 OS: debian 10
Thanks. Cheers, Dimitar