Closed ELind77 closed 8 years ago
+1 for the threadsafe question.
After spending more time with the problem I found the python locking primatives for the multiprocessing library and I also found the parallel example that's hidden away in the source that uses joblib. So I may base something on that. But I'm running this project on Apache Storm, it's going to be continuously running, and if each StringStore continues to grow in size as more new word forms come in (probably hashtags) it's going to cause memory problems. If I have to I can preprocess to try and mitigate that, but it would be nice if I didn't have to worry about that.
We definitely want to fix this.
As far as workarounds that you could do immediately go...
1) You could subclass Vocab
and StringStore
, in Cython. If you do this you might go most of the way towards designing a solution the library could host.
2) The data structure that's growing is StringStore._map
. It's an instance of this class: https://github.com/syllog1sm/preshed/blob/master/preshed/maps.pyx#L8 . This is marked public
in Cython, so you should be able to replace it, modify it, etc. You could try to subclass it, or you could periodically prune it.
3) You could pre-process the text, so that unseen words are replaced by a generic key. This feels like a terrible solution, but it's the least intrusive to spaCy's internals, and it should actually be quite efficient, since the tokenizer is so fast. I'm thinking of something like this (untested code, that I just typed out, and have not run)
from spacy.strings import hash_string
from spacy.en import English
def process_batch(nlp, texts):
tokenizer = English(parser=False, tagger=False, entity=False).tokenizer
for text in texts:
pre_tokens = helper(text)
munged = ''
for token in pre_tokens:
key = hash_string(token.text)
if nlp.vocab.strings._map[key] is None:
munged += my_string_process(token.text) + token.whitespace_
yield nlp(munged)
A couple of subtleties here:
StringStore
class adds keys to its dictionary automatically. (Yes, I know). So we have to use the internal, _map. This will be fixed in future.token.whitespace_
. This adds spaces occur in between the tokens. This is preferable to doing someyhing like ' '.join
, which will add spurious spaces between punctuation, contractions etc. If you can't recover the string like this for some reason, you can feed a list of strings to nlp.tokenizer.tokens_from_strings, and then apply the rest of the pipeline piece by piece (nlp.tagger(tokens); nlp.parser(tokens); nlp.entity(tokens)`As I said --- this will all be fixed. But, I figured I could at least suggest some work arounds.
I've now got the GIL freed for the entire parser. This allows us to multi-thread over whole documents: https://github.com/honnibal/spaCy/blob/rethinc2/spacy/syntax/parser.pyx#L128
First test is getting 4x efficiency from 8 threads, with almost no additional memory use.
@honnibal
This is great news! Thank you so much for taking the time to add this feature.
I think I will end up preprocessing with a simple regex to replace #tags
and @mentions
with HASHTAG and MENTION to avoid growing the string store for my specific case. But the paralelization is a much better general solution for this issue.
-- Eric
I've also cut the loading time down to about 15 seconds, so things should be much nicer on that front too.
Depending on your workflow, you might want to try sending batches of documents out to a worker process, which then loads the pipeline, completes the work, and sends back the analyses in byte-strings. The byte-strings can be loaded by the original
No worries! Looking forward to getting this pushed.
I'm tempted to overload the English.__call__
function, to allow it to accept either a single text, or an iterable of texts. If it takes a single text it produces a Doc
object, and if you give it an iterable of texts it will give you a generator, and support multi-threading.
I'm not sure this is a good idea, though. This sort of overloading often gets hard to work with in Python, as it makes everything fail late.
Assuming we decide against the overloading, and give it its own method, usage would be like this:
iter_texts = iter_comments(loc)
for doc in nlp.process_stream(iter_texts, n_threads=6, batch_size=1000):
# Do work
In other words the function takes an iterator of texts and produces an iterator of documents. Internally it buffers 1000 texts before working on them with 6 threads.
Each component in the pipeline (the tokenizer, tagger, parser and NER) will operate on the stream. They'll each accumulate a buffer and then send it to threads, but their output will be a generator.
Multi-threading is pushed now. Docs are updated, yet to write the tutorial. Full example here: https://github.com/spacy-io/spaCy/blob/master/examples/parallel_parse.py#L48
The multi-threading is working super well so far. On a batch of 10,000 documents, with 1 thread we're getting about 35ms per document for the full pipeline. With 4 threads we're seeing about 10ms — so, close to full efficiency.
I'm also finding working with generators to be super easy. I can recommend the toolz
library highly. We should also be able to play nicely with the asyncio
stuff in Python 3.5, although I haven't quite gotten my head around it yet.
Some numbers on a larger job --- parsing about 10gb of text, from one month of comments from the Reddit comment corpus. The machine has 92 cores.
Processes | Threads per process | Time |
---|---|---|
64 | 2 | 99m |
32 | 4 | 92m |
We'll run the rest of the experiments to figure out what the best balance is, and probably make some adjustments. But it's already looking really great. If we can run at, say, 32 processes and utilize all 92 cores at full efficiency, we've effectively cut the memory use by 1/3. I'm running the 16 process, 8 threads experiment now. I expect the efficiency will start to drop here. The io, decompression, json parsing, tokenization, tagging and queueing are all done while holding the GIL, so only one thread can be used while the process is setting up the work. So to keep the machine busy we have to have lots of processes, so that some are always parsing.
@honnibal this is really exciting! I'm so glad that you've been able to work on this.
I seem to remember from reading the spacy commit messages a few months back (yes, I read the commits sometimes :P ) that you switched to ujson for the extra speed boost (I use it too). Since ujson is a C library, would it be possible to use it without the GIL? I won't pretend to understand the threading issues here but it seemed worth pointing out.
Also, this is probably premature optimization, but you also mentioned io and I don't know where that is in the spacy code, but it might be worth making any io you have to do asynchronous even if it remains single-threaded.
Also, I really enjoy generators too, I was actually inspired by the gensim goal of memory independence and try to model most of what I write in the same way. I've also recently become fascinated with coroutines. I'll take a look at toolz
.
Thanks again!
-- Eric
Is there any way to also pass metadata through the pipe like, for instance, a document id?
Let's say you have a generator of (doc_id, text)
tuples. If you use the itertools.tee
function, you can split this into two generators:
gen1, gen2 = itertools.tee(id_and_text)
Both gen1
and gen2
will be identical tuples of (doc_id, text)
pairs. You can then split these into just the ID and just the text:
ids = (id_ for (id_, text) in gen1)
texts = (text for (id_, text) in gen2)
Then you can use itertools.izip
to rejoin the generators, after you pass the texts through .pipe
. Putting it all together:
import itertools
from spacy.en import English
def gen_items():
print("Yield 0")
yield (0, 'Text 0')
print("Yield 1")
yield (1, 'Text 1')
print("Yield 2")
yield (2, 'Text 2')
nlp = English()
gen1, gen2 = itertools.tee(gen_items())
ids = (id_ for (id_, text) in gen1)
texts = (text for (id_, text) in gen2)
docs = nlp.pipe(texts)
for id_, doc in itertools.izip(ids, docs):
print(id_, doc.text)
Ah, thank you! I had assumed that the ordering of the documents returned by pipe
was non-deterministic.
It is? The documents are returned in the same order as the texts.
@honnibal Hey Matthew hope all is well, great news on the press release
So i'm running the parallelparse script I end up with Batch 0..1...2..and so on When I check my output directory I see .bin files, is this normal? my input file is enwiki-20160720-pages-articles.xml.bz2
sudo python ParallelParse.py /home/ubuntu/mindlab/WikiAnalysis/opt/enwiki-20160720-pages-articles.xml.bz2 /home/ubuntu/mindlab/WikiAnalysis/opt -p 64 -t 4
And is the process vs thread optimal? Furthermore, will this script alone run the whole spacy pipeline through my wikidump or will i have to use merge_text.py, train_word2vec.py and sense2vec/vectors.pyx seperately ?
A DUMMIES question; Not sure, if the above discussion already answers this.
My server-side code (am using Flask), at start-up does
nlp = spacy.load('en')
It then has to serve requests (multiple concurrent) from the clients. Can I use the same object as
document = nlp(input_text)
@jofatmofn: Did you find out?
@chedched, I am using that style (though not in production yet) in development.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I've been using Spacy for a few months now (it's amazing!) and I've added it as part of a processing pipeline I'm working on. In order to get more throughput through the pipeline I've been using JoinableQueue and multiprocessing.Process to add parallelism. Using this with spacy creates two difficulties:
My understanding is that workers created with python multiprocessing have properties that are passed by value which means that the tagging/parsing models are duplicated. You've done an incredible job making the models efficient and compact, but now I'm duplicating them all over the place and wasting your hard-won efficiency. Is there any way that the models can be used as some kind of shared memory object?
Similarly, it would be nice if the StringStore could be shared across workers somehow. I'm looking into creating a generative language model based on syntactic ngrams for text classification and an efficient storage for all those ngrams would be really nice (also, if spacy can do it I don't have to work in Java :P ).
I haven't worked in cython directly before but having browsed the source of this project for the last couple of months I'm not afraid to go there so long as I don't have to port my entire codebase, e.g. just make the main runner in cython and leave the rest in python.
-- Eric
P.S. Probably should have started with this, but is spacy thread-safe?