explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.71k stars 4.36k forks source link

Documentation needed on how to speed up the nlp.pipe() usage #5239

Closed RevanthRameshkumar closed 4 years ago

RevanthRameshkumar commented 4 years ago

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size: https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size: https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000). On a small dataset of 2000 sentences: data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000 nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

adrianeboyd commented 4 years ago

The main issue is that multiprocessing has a lot of overhead when starting child processes, and this overhead is especially high in windows, which uses spawn instead of fork. You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

You might also see some improvements by using a smaller number of processes for shorter tasks, but you'd have to experiment with this. It depends on the model and pipeline, how long your texts are, etc. and there's no single set of guidelines that will be optimal for every case.

See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

RevanthRameshkumar commented 4 years ago

You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

In this case, are longer tasks equivalent to a much larger batch size? Are we creating a child process per batch in Spacy?

bgeneto commented 4 years ago

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size: https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size: https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000). On a small dataset of 2000 sentences: data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000 nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

I've tried everything I could but couldn't find a single example where n_process > 1 resulted in better performance. In fact performance are terribly worse even with n_process = 2. I think developers should provide a minimal speed up example, otherwise so many people out there will lose precious time testing and benchmarking just to find out that there is no way to parallelize this kind of job with spaCy/scispacy.

RevanthRameshkumar commented 4 years ago

I'm going to close this on my end. If there are further issues regarding documentation, there should be a new issue opened.

mariomeissner commented 4 years ago

I've run some tests with n_process as well. My results differ depending on if I disable modules or not. Leaving all pipeline components enabled, I was able to see decent speed boosts as I increased the number of cores. However, when disabling pipeline components (disable=["parser", "tagger", "ner"]), the results actually slightly worsen as I increase the cores.

I ran these tests on Linux with 64GB RAM and i7-8700K 6-core CPU.

Honestly would have expected to get a boost even when disabling components, as even tokenization itself could be done in parallel through several CPUs, but it seems that only the pipeline components themselves are parallelized.

afparsons commented 3 years ago

I'm planning on running some real benchmarks which track:

...all while incrementally stepping up n_process, batch_size, max_length, text input length, etc.

However, I won't get to that for another month or so.

In the meanwhile, here's something extremely messy that I just cobbled together:

In [31]: for bs in range(5, 205, 10):
    ...:     start = perf_counter()
    ...:     pipe = nlp.pipe(texts=l_qs, n_process=15, batch_size=bs)
    ...:     tokens = [doc._.word2vec_tokens for doc in pipe]
    ...:     duration = perf_counter() - start
    ...:     print(f'{len(l_qs)=}, n=15, {bs=}, {duration=}')
    ...: 
len(l_qs)=20000, n=15, bs=5, duration=23.126640622504056
len(l_qs)=20000, n=15, bs=15, duration=18.476141318678856
len(l_qs)=20000, n=15, bs=25, duration=16.737955709919333
len(l_qs)=20000, n=15, bs=35, duration=16.13755234517157
len(l_qs)=20000, n=15, bs=45, duration=15.856405307538807
len(l_qs)=20000, n=15, bs=55, duration=16.509178745560348
len(l_qs)=20000, n=15, bs=65, duration=15.92285352293402
len(l_qs)=20000, n=15, bs=75, duration=15.811818609945476
len(l_qs)=20000, n=15, bs=85, duration=16.032887232489884
len(l_qs)=20000, n=15, bs=95, duration=16.083074554800987
len(l_qs)=20000, n=15, bs=105, duration=16.53166967909783
len(l_qs)=20000, n=15, bs=115, duration=15.948719997890294
len(l_qs)=20000, n=15, bs=125, duration=16.174022526480258
len(l_qs)=20000, n=15, bs=135, duration=16.075954588130116
len(l_qs)=20000, n=15, bs=145, duration=16.281513331457973
len(l_qs)=20000, n=15, bs=155, duration=16.614797863177955
len(l_qs)=20000, n=15, bs=165, duration=16.395240667276084
len(l_qs)=20000, n=15, bs=175, duration=16.567301841452718
len(l_qs)=20000, n=15, bs=185, duration=16.598147234879434
len(l_qs)=20000, n=15, bs=195, duration=16.618470830842853

ls_qs is simply a list of sentences.

Please don't take these numbers extremely seriously. In this specific case, it looks like batch_size=(n_process * 5) seemed about optimal, but I'm sure there are many variables at play here.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.