Open hobodrifterdavid opened 1 year ago
I made a script to load test script. Translating flores dataset EN => ES. Flores sentences are pretty long, newspaper-type sentence.
Here the server handles a request, translate_batch is being called with a "Req. Batchsize" sentences, the response is sent, then the server receives the next request.
Req. Batchsize: 1, sents/s: 1.89 Req. Batchsize: 4, sents/s: 6.08 Req. Batchsize: 16, sents/s: 15.27 Req. Batchsize: 64, sents/s: 47.00 Req. Batchsize: 256, sents/s: 64.82 Req. Batchsize: 1000, sents/s: 165.06
Here the requesting machine does 4 concurrent translation requests:
Batchsize: 1, sents/s: 2.89 Batchsize: 4, sents/s: 8.53 Batchsize: 16, sents/s: 25.45 Batchsize: 64, sents/s: 52.98 Batchsize: 256, sents/s: 67.36 Batchsize: 1000, sents/s: 175.07
And 16:
Batchsize: 1, sents/s: 2.90 Batchsize: 4, sents/s: 8.54 Batchsize: 16, sents/s: 25.49 Batchsize: 64, sents/s: 52.67 Batchsize: 256, sents/s: 65.13 Batchsize: 1000, sents/s: 174.77
I think the code may benefit from figuring out the translate_batch asynchronous=True.. or else certainly from making a task queue and batching between requests. Does this code exist already?
EDIT: going to --workers 2, I'm seeing GPU ram at 16GB+ (it's 8GB+ with a single worker), so def two models loaded in.. running again with 16 concurrent requests, it's a bit faster:
Batchsize: 1, sents/s: 3.67 Batchsize: 4, sents/s: 11.88 Batchsize: 16, sents/s: 33.95 Batchsize: 64, sents/s: 53.02 Batchsize: 256, sents/s: 87.28 Batchsize: 1000, sents/s: OOM
EDIT2: I found the code snippet for asynchronous=True, it's doesn't help perf though here, after a bit of reading, fastAPI already runs sync requests in a ThreadPool.. uhhuh. Numbers with 16 concurrent requests, 1 uvicorn worker: Batchsize: 1, sents/s: 2.50 Batchsize: 4, sents/s: 8.22 Batchsize: 16, sents/s: 24.26 Batchsize: 64, sents/s: 50.99 Batchsize: 256, sents/s: 66.62 Batchsize: 1000, sents/s: 172.69
EDIT3: Same as above (16 concurrent requests, 1 uvicorn worker), with 1 GPU only (device_index=[0]):
Batchsize: 1, sents/s: 2.61 Batchsize: 4, sents/s: 7.69 Batchsize: 16, sents/s: 21.74 Batchsize: 64, sents/s: 52.34 Batchsize: 256, sents/s: 58.01 Batchsize: 1000, sents/s: 73.07
With 2 GPU (device_index=[0,1]):
Batchsize: 1, sents/s: 2.71 Batchsize: 4, sents/s: 8.51 Batchsize: 16, sents/s: 25.33 Batchsize: 64, sents/s: 52.62 Batchsize: 256, sents/s: 67.17 Batchsize: 1000, sents/s: 138.76
Seems like benefit of multi-gpu starts to give a benefit above batch sizes of 256.
EDIT: I made code that collects requests and batches them, big improvement when lots of small requests, will post code tomorrow.
Concurrent requests are processed sequentially in the translate
function which is not ideal. Ideally, translate
should be called from multiple Python threads which would automatically enable multi-GPU translations. Some webservers allow using multiple worker threads (not processes!), but it does not seem to be the case uvicorn.
Note that in your example, multiple GPUs are only used in the case "Batchsize: 1000": the request will be rebatched with max_batch_size=256
and each sub-batch will be executed by a different GPU.
For all other cases only a single GPU is working at a time because translate
is executed sequentially AND the request size is smaller or equal to max_batch_size
.
Hmm. Ok, so seems I had an 'async' fastAPI handler, but then I was calling blocking functions inside it (translate_batch). If you don't have an 'async' request handler, fastAPI spawns a thread to run it, so it doesn't block the handling of other requests. (I am explaining to myself, I am figuring this stuff out).
Solution one is just to remove the async keyword from the request handler here, and fastAPI runs the request handler in a thread:
Performance (16 concurrent requests, 4 gpus, 1 uvicorn worker, switched to max_batch_size 128 to avoid potential OOM errors):
Batchsize: 1, sents/s: 7.82 Batchsize: 4, sents/s: 25.04 Batchsize: 16, sents/s: 94.00 Batchsize: 64, sents/s: 197.08 Batchsize: 256, sents/s: 205.18 Batchsize: 1000, sents/s: 251.80
Solution two is that you can use an 'async' request handler, but wrap the blocking call like this, this also spawns a thread and doesn't block the event loop:
Batchsize: 1, sents/s: 6.98 Batchsize: 4, sents/s: 25.10 Batchsize: 16, sents/s: 97.21 Batchsize: 64, sents/s: 194.80 Batchsize: 256, sents/s: 217.99 Batchsize: 1000, sents/s: 249.32
Performance exactly the same.
Okays, now I switch on the code that collects translations from many requests and processes them togther in a batch: (16 concurrent req)
Batchsize: 1, sents/s: 11.89 Batchsize: 4, sents/s: 31.72 Batchsize: 16, sents/s: 74.01 Batchsize: 64, sents/s: 195.33 Batchsize: 256, sents/s: 176.32 Batchsize: 1000, sents/s: 236.62
Increasing concurrent reqs to 128 (fastAPI only allows 40 to be handled at a time IIRC):
Batchsize: 1, sents/s: 48.86 Batchsize: 4, sents/s: 93.88 Batchsize: 16, sents/s: 190.34 Batchsize: 64, sents/s: 177.50 Batchsize: 256, sents/s: 169.45 Batchsize: 1000, sents/s: 234.16
The code is here, it's a bit ugly: https://github.com/hobodrifterdavid/nllb-docker-rest/blob/main/app.py
There's a potential concern that the single python process is bottlenecking the throughput, it might be better to start two containers or something, and split the gpus between them. Didn't check this. Or, you might have two sets of gpus handling requests independantly in one python process. From here need to make a decision about how to balance latency and thoughput, possibly prioritising shorter translations.
Batchsize: 1, sents/s: 7.82 Batchsize: 4, sents/s: 25.04 Batchsize: 16, sents/s: 94.00 Batchsize: 64, sents/s: 197.08 Batchsize: 256, sents/s: 205.18 Batchsize: 1000, sents/s: 251.80
These numbers look good to me. The performance increase is almost linear with the number of GPUs, e.g. for "Batchsize: 64" going from 52.34 to 197.08 is a ~3,8x speedup.
@hobodrifterdavid why don't you make a PR to replace our old-style translation_server based on Flask with FastAPI + uvicorn?
see here: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/bin/server.py and for your info, I made a tuto to finetune NLLB200 here: https://forum.opennmt.net/t/finetuning-and-curating-nllb-200-with-opennmt-py/5238
That would be helpful to have a robust and fast server solution for OpenNMT. Cheers.
@vince62s Hi Vince. One advantage of the sketch I made (there's no error handling etc. yet) is that one python process handles multiple concurrent requests. This means you can make translations for multiple requests together, calling translate_batch once, which helps a lot when you are handling a lot of small translations with NLLB. With NLLB, a single batch can contain translations with different langauge pairs, you can't do that with marian models etc. Actually, I'm not certain that CTranslate2 doesn't have some kind of combine-smaller-translations-into-a-bigger-batch code internally (@guillaumekln), I could run the code without the request-batching-stuff, with say 128 concurrent requests, to check. I'd like to contribute, just a little crushed currently, we're deploying a chat feature on Language Reactor. Actually the 'sketch' is already handling translations for 'Text Mode' (you import a website/paste a text).
I noticed, if you give nllb subtitles to translate with multiple sentences, it only translates one sentence (seems, the longest). The OPUS models do this too I think. So a sentence tokeniser is needed, that handles 100+ languages. UDPipe can handle maybe 50, but it's an expensive way to break sentences. I'll check your repo to see what you are doing.
EDIT: I'm probably missing something simple.. but, this web demo also translates a single sentence only, and ignores the second: (https://huggingface.co/spaces/Geonmo/nllb-translation-demo)
finetune NLLB200 - this is very cool and I will certainly take a look, thanks for that. Can you also use say 4 24GB gpus for training the 1.3B model? Anyways, getting off topic. :)
Actually, I'm not certain that CTranslate2 doesn't have some kind of combine-smaller-translations-into-a-bigger-batch code internally
There is a C++ class that does this, but it's not currently used and there are no equivalent in the Python API.
Thanks Guillaume.
It was recommended to me to use split-sentences.perl logic from Moses. Sacremoses (a python reimplementation of some Moses scripts) doesn't have a complete reimplementation of split-sentences.perl, so I used the mosestokenizer (https://github.com/luismsgomes/mosestokenizer) which bridges from Python to the original Perl code using pipes. The code (some version) is also in the Moses repo (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/mosestokenizer/sentsplitter.py)
EDIT: In the NLLB paper, they mention the stopes data prep pipeline, my repo now has the relevant the relevant code for sentence segmenting and cleaning from there (stopes_snippet folder).
A few notes:
I have a custom implementation using a Flask REST API server which can help with preserving structure of sentences in regards to new lines or custom strings/characters you may want to exclude from translation reqs to the model if needed
There was a nasty bug when batches were processed with multiple languages.. pushed a fix.. code seems pretty stable otherwise.
Hi, were you able to figure out how to translate all things in a batch if the time of requests come within a DELTA?
I just stumbled upon this issue as i want to do something extremely similar (fastapi +nllb, though without any GPUs). It'll be at least a few months til i have time to look into any of this again, but I wonder if gunicorn could be helpful here with the multithreaded stuff?
https://fastapi.tiangolo.com/deployment/server-workers/
Edit: on second thought, I don't think this is what you're looking for as it would surely just multiply the Python processes (and ram requirements). It seems like what you did with async batching requests is a nice way to handle it all.
Perhaps it could be combined with an effort such as that mentioned in this issue for continuous batching? (though, it was already confirmed there that it isn't and won't really be possible and a batching mechanism like has already been done here was recommended) https://github.com/OpenNMT/CTranslate2/issues/1333
Hello. So, I want to run NLLB-200 (3.3B) model on a server with 4x 3090, and a say, 16 core AMD Epyc cpu. I wrapped Ctranslate2 in fastAPI, running with uvicorn, inside a docker container with GPU support.
All code is here, feel free to do whatever with it: https://github.com/hobodrifterdavid/nllb-docker-rest
I want to handle requests with between 1 and 1000 sentences, with a reasonable balance between latency and throughput.
Here's a few things I did, from reading the documentation:
for ctranslate2.Translator:
device='auto', # May use CPU for very small translations? compute_type="float16", device_index=[0, 1, 2, 3]
for translator.translate_batch:
max_batch_size=256 # Bigger than this I get Cuda OOM errors.
I tried to use translate_batch with asynchronous=True,
but couldn't figure out easily how to await the results(EDIT: figured it out, added results below)uvicorn is run without the --workers flag, so, defaults to a python process, a single model loaded into GPU ram. FastAPI accepts up to 40 concurrent requests.
Anyway, I'll carry on trying to improve this setup, will post further results. If there are some suggestions for something I missed, it would be appreciated. Python is not my first langauge, please excuse naive errors.