explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.67k stars 4.36k forks source link

Multithreading #172

Closed ELind77 closed 8 years ago

ELind77 commented 8 years ago

I've been using Spacy for a few months now (it's amazing!) and I've added it as part of a processing pipeline I'm working on. In order to get more throughput through the pipeline I've been using JoinableQueue and multiprocessing.Process to add parallelism. Using this with spacy creates two difficulties:

  1. Duplication of models
  2. Duplication of the spacy StringStore

My understanding is that workers created with python multiprocessing have properties that are passed by value which means that the tagging/parsing models are duplicated. You've done an incredible job making the models efficient and compact, but now I'm duplicating them all over the place and wasting your hard-won efficiency. Is there any way that the models can be used as some kind of shared memory object?

Similarly, it would be nice if the StringStore could be shared across workers somehow. I'm looking into creating a generative language model based on syntactic ngrams for text classification and an efficient storage for all those ngrams would be really nice (also, if spacy can do it I don't have to work in Java :P ).

I haven't worked in cython directly before but having browsed the source of this project for the last couple of months I'm not afraid to go there so long as I don't have to port my entire codebase, e.g. just make the main runner in cython and leave the rest in python.

-- Eric

P.S. Probably should have started with this, but is spacy thread-safe?

jonathankoren commented 8 years ago

+1 for the threadsafe question.

ELind77 commented 8 years ago

After spending more time with the problem I found the python locking primatives for the multiprocessing library and I also found the parallel example that's hidden away in the source that uses joblib. So I may base something on that. But I'm running this project on Apache Storm, it's going to be continuously running, and if each StringStore continues to grow in size as more new word forms come in (probably hashtags) it's going to cause memory problems. If I have to I can preprocess to try and mitigate that, but it would be nice if I didn't have to worry about that.

honnibal commented 8 years ago

We definitely want to fix this.

As far as workarounds that you could do immediately go...

1) You could subclass Vocab and StringStore, in Cython. If you do this you might go most of the way towards designing a solution the library could host.

2) The data structure that's growing is StringStore._map. It's an instance of this class: https://github.com/syllog1sm/preshed/blob/master/preshed/maps.pyx#L8 . This is marked public in Cython, so you should be able to replace it, modify it, etc. You could try to subclass it, or you could periodically prune it.

3) You could pre-process the text, so that unseen words are replaced by a generic key. This feels like a terrible solution, but it's the least intrusive to spaCy's internals, and it should actually be quite efficient, since the tokenizer is so fast. I'm thinking of something like this (untested code, that I just typed out, and have not run)

from spacy.strings import hash_string
from spacy.en import English

def process_batch(nlp, texts):
    tokenizer = English(parser=False, tagger=False, entity=False).tokenizer
    for text in texts:
        pre_tokens = helper(text)
        munged = ''
        for token in pre_tokens:
            key = hash_string(token.text)
            if nlp.vocab.strings._map[key] is None:
                munged += my_string_process(token.text) + token.whitespace_
        yield nlp(munged)

A couple of subtleties here:

As I said --- this will all be fixed. But, I figured I could at least suggest some work arounds.

honnibal commented 8 years ago

I've now got the GIL freed for the entire parser. This allows us to multi-thread over whole documents: https://github.com/honnibal/spaCy/blob/rethinc2/spacy/syntax/parser.pyx#L128

First test is getting 4x efficiency from 8 threads, with almost no additional memory use.

ELind77 commented 8 years ago

@honnibal

This is great news! Thank you so much for taking the time to add this feature.

I think I will end up preprocessing with a simple regex to replace #tags and @mentions with HASHTAG and MENTION to avoid growing the string store for my specific case. But the paralelization is a much better general solution for this issue.

-- Eric

honnibal commented 8 years ago

I've also cut the loading time down to about 15 seconds, so things should be much nicer on that front too.

Depending on your workflow, you might want to try sending batches of documents out to a worker process, which then loads the pipeline, completes the work, and sends back the analyses in byte-strings. The byte-strings can be loaded by the original

honnibal commented 8 years ago

No worries! Looking forward to getting this pushed.

I'm tempted to overload the English.__call__ function, to allow it to accept either a single text, or an iterable of texts. If it takes a single text it produces a Doc object, and if you give it an iterable of texts it will give you a generator, and support multi-threading.

I'm not sure this is a good idea, though. This sort of overloading often gets hard to work with in Python, as it makes everything fail late.

Assuming we decide against the overloading, and give it its own method, usage would be like this:

iter_texts = iter_comments(loc)
for doc in nlp.process_stream(iter_texts, n_threads=6, batch_size=1000):
        # Do work

In other words the function takes an iterator of texts and produces an iterator of documents. Internally it buffers 1000 texts before working on them with 6 threads.

Each component in the pipeline (the tokenizer, tagger, parser and NER) will operate on the stream. They'll each accumulate a buffer and then send it to threads, but their output will be a generator.

honnibal commented 8 years ago

Multi-threading is pushed now. Docs are updated, yet to write the tutorial. Full example here: https://github.com/spacy-io/spaCy/blob/master/examples/parallel_parse.py#L48

The multi-threading is working super well so far. On a batch of 10,000 documents, with 1 thread we're getting about 35ms per document for the full pipeline. With 4 threads we're seeing about 10ms — so, close to full efficiency.

I'm also finding working with generators to be super easy. I can recommend the toolz library highly. We should also be able to play nicely with the asyncio stuff in Python 3.5, although I haven't quite gotten my head around it yet.

honnibal commented 8 years ago

Some numbers on a larger job --- parsing about 10gb of text, from one month of comments from the Reddit comment corpus. The machine has 92 cores.

Processes Threads per process Time
64 2 99m
32 4 92m

We'll run the rest of the experiments to figure out what the best balance is, and probably make some adjustments. But it's already looking really great. If we can run at, say, 32 processes and utilize all 92 cores at full efficiency, we've effectively cut the memory use by 1/3. I'm running the 16 process, 8 threads experiment now. I expect the efficiency will start to drop here. The io, decompression, json parsing, tokenization, tagging and queueing are all done while holding the GIL, so only one thread can be used while the process is setting up the work. So to keep the machine busy we have to have lots of processes, so that some are always parsing.

ELind77 commented 8 years ago

@honnibal this is really exciting! I'm so glad that you've been able to work on this.

I seem to remember from reading the spacy commit messages a few months back (yes, I read the commits sometimes :P ) that you switched to ujson for the extra speed boost (I use it too). Since ujson is a C library, would it be possible to use it without the GIL? I won't pretend to understand the threading issues here but it seemed worth pointing out.

Also, this is probably premature optimization, but you also mentioned io and I don't know where that is in the spacy code, but it might be worth making any io you have to do asynchronous even if it remains single-threaded.

Also, I really enjoy generators too, I was actually inspired by the gensim goal of memory independence and try to model most of what I write in the same way. I've also recently become fascinated with coroutines. I'll take a look at toolz.

Thanks again!

-- Eric

nhynes commented 8 years ago

Is there any way to also pass metadata through the pipe like, for instance, a document id?

honnibal commented 8 years ago

Let's say you have a generator of (doc_id, text) tuples. If you use the itertools.tee function, you can split this into two generators:

gen1, gen2 = itertools.tee(id_and_text)

Both gen1 and gen2 will be identical tuples of (doc_id, text) pairs. You can then split these into just the ID and just the text:

ids = (id_ for (id_, text) in gen1) texts = (text for (id_, text) in gen2)

Then you can use itertools.izip to rejoin the generators, after you pass the texts through .pipe. Putting it all together:

import itertools
from spacy.en import English

def gen_items():
    print("Yield 0")
    yield (0, 'Text 0')
    print("Yield 1")
    yield (1, 'Text 1')
    print("Yield 2")
    yield (2, 'Text 2')

nlp = English()
gen1, gen2 = itertools.tee(gen_items())
ids = (id_ for (id_, text) in gen1)
texts = (text for (id_, text) in gen2)
docs = nlp.pipe(texts)
for id_, doc in itertools.izip(ids, docs):
    print(id_, doc.text)
nhynes commented 8 years ago

Ah, thank you! I had assumed that the ordering of the documents returned by pipe was non-deterministic.

honnibal commented 8 years ago

It is? The documents are returned in the same order as the texts.

Mustyy commented 8 years ago

@honnibal Hey Matthew hope all is well, great news on the press release

So i'm running the parallelparse script I end up with Batch 0..1...2..and so on When I check my output directory I see .bin files, is this normal? my input file is enwiki-20160720-pages-articles.xml.bz2

sudo python ParallelParse.py /home/ubuntu/mindlab/WikiAnalysis/opt/enwiki-20160720-pages-articles.xml.bz2 /home/ubuntu/mindlab/WikiAnalysis/opt -p 64 -t 4

And is the process vs thread optimal? Furthermore, will this script alone run the whole spacy pipeline through my wikidump or will i have to use merge_text.py, train_word2vec.py and sense2vec/vectors.pyx seperately ?

jofatmofn commented 7 years ago

A DUMMIES question; Not sure, if the above discussion already answers this.

My server-side code (am using Flask), at start-up does nlp = spacy.load('en')

It then has to serve requests (multiple concurrent) from the clients. Can I use the same object as document = nlp(input_text)

chedched commented 6 years ago

@jofatmofn: Did you find out?

jofatmofn commented 6 years ago

@chedched, I am using that style (though not in production yet) in development.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.