TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

Multiprocessing error #21

Closed dimitarsh1 closed 3 years ago

dimitarsh1 commented 4 years ago

When running the spacy_udpipe with the n_process = X enabled, it gives an error. The code I run is:

nlpD = spacy_udpipe.load(lang)
nlps = list(nlpD.pipe(sentences, n_process=4))
for doc in nlps:
        for token in doc:
            lemma=token.lemma_

The error is:

 File "token.pyx", line 871, in spacy.tokens.token.Token.lemma_.__get__
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '14027581762467160941'. This usually refers to an issue with the `Vocab` or `StringStore`."

When I run the same code, but without the n_process argument, then everything is fine. No errors, the text is processed and so on.

It seems to be related to a spaCy issue, but I couldn't find a solution. https://stackoverflow.com/questions/60152152/spacy-issue-with-vocab-or-stringstore

spaCy version: 2.2.4 python version: 3.8.3 spacy-udpipe version: 0.3.0 OS: debian 10

Thanks. Cheers, Dimitar

asajatovic commented 4 years ago

Seems to be related to https://github.com/explosion/spaCy/issues/5220.

A quick workaround is to change only the first line: nlpD = spacy_udpipe.load(lang).tokenizer.

This should do the trick as it will call UDPipeTokenizer.pipe(creating a Doc with the same attributes as Language.pipe, just bug-free). If you want to use custom pipes afterward, you could call them on the resulting Doc objects (once created, these are modified in-place anyway), for now.

I will look into a proper fix soon, hopefully.

dimitarsh1 commented 4 years ago

Great,

Thanks a lot.

Kind regards, Dimitar

On Mon, 11 May 2020, 09:26 asajatovic, notifications@github.com wrote:

Seems to be related to explosion/spaCy#5220 https://github.com/explosion/spaCy/issues/5220.

A quick workaround is to change only the first line: nlpD = spacy_udpipe.load(lang).tokenizer.

This should do the trick as it will call UDPipeTokenizer.pipe https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L141-L158(creating a Doc with the same attributes as Language.pipe, just bug-free). If you want to use custom pipes afterward, you could call them on the resulting Doc objects (once created, these are modified in-place anyway), for now.

I will look into a proper fix soon, hopefully.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TakeLab/spacy-udpipe/issues/21#issuecomment-626552678, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYLMEXDL7HYJFCEDFQEN6DRQ6ZCPANCNFSM4M5KSOXA .

asajatovic commented 4 years ago

The issue happens with StringStore.add. When Language.__call__ is called, a Doc is created using UDPipeTokenizer.__call__ in which token attributes are added to StringStore object (see these lines). The same code is run when Language.pipe is called, but via a custom spaCy multiprocessing code. Now, this is where the havoc begins as Python multiprocessing objects and code interact with SWIG wrapper code for UDPipe model (i.e. UDPipe Python bindings which actually enable the underlying NLP model in C++) and somehow in the end the StringsStore object does not contain all the string values it should (missing lemmas, tags, dependencies, etc.).

Interestingly, when the same UDPipeLanguage object processes same texts via __call__ first and then via pipe, everything works fine as the StringStore object is already prepopulated with all string values.

Unfortunately, neither multithreading nor UDPipeTokenizer multiprocessing speeds up execution.

dimitarsh1 commented 4 years ago

Thanks. Could that relate to the other issue "[E190] Token head out of range"?

asajatovic commented 4 years ago

@dimitarsh1 you are welcome. [E190] should not be happening when using __call__ since version 0.2.1 or with pipe since version 0.3.1. Unfortunately, without the exact input that causes this, it is very difficult to conclude anything with certainty.

BramVanroy commented 3 years ago

Related: on the most recent version of spacy_udpipe, pipe does not work with n_process > 1 on Windows because cannot pickle 'ufal.udpipe.Model' object. Works fine on Linux by default. Evidently, the spin-off method is crucial here. On Windows we only have "spawn", which is more restrict in terms of pickling compared to "fork". Linux has both but defaults to "fork". If you multiprocessing.set_start_method("spawn") on Linux, the code will also fail.

Should I make a separate issue for this? Might be difficult to solve this one, though, and perhaps impossible if you have no control over the UDPipe model directly.

Traceback (most recent call last):
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\Scripts\parse-as-conll-script.py", line 33, in <module>
    sys.exit(load_entry_point('spacy-conll', 'console_scripts', 'parse-as-conll')())
  File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 179, in main
    parse(cargs)
  File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 35, in parse
    conll_str = parser.parse_file_as_conll(
  File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 81, in parse_file_as_conll
    return self.parse_text_as_conll(text, **kwargs)
  File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 135, in parse_text_as_conll
    for doc_idx, doc in enumerate(self.nlp.pipe(text, n_process=n_process)):
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1484, in pipe
    for doc in docs:
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1520, in _multiprocessing_pipe
    proc.start()
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'ufal.udpipe.Model' object
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
mariosasko commented 2 years ago

Hi @BramVanroy, thanks for reporting. This should be fixed by #39 soon.

BramVanroy commented 2 years ago

Awesome! Thanks.