Open aniketmaurya opened 1 week ago
Also, to add to the question above regarding scaling, how do you scale this based hundreds of requests per second? If you're running in the cloud, do you spin up multiple containers?
Single container: So the benchmark shows BERT-Large model with automatic batching and multiprocessing. A single model process can runs prediction on a batch of 16-32 requests to increase the throughput. Additionally, if GPU memory allows it can also spin up extra process to handle more requests. The requests are load balanced on process level via uvicorn socket.
And yes, in cloud you can also spin up multiple containers for further scale.
Docling is a great project! Got to know about this from Spacy-layout.
This is powered by vanilla FastAPI, which is good but won't scale and lacks stuff like dynamic batching and autoscaling. I would suggest to use a library specialized for serving ML based APIs like LitServe or RayServe.