Docling serve at scale - Githubissues

DS4SD / docling-serve

Running Docling as an API service

MIT License

19 stars 4 forks source link

Docling serve at scale #10

Open aniketmaurya opened 1 week ago

aniketmaurya commented 1 week ago

Docling is a great project! Got to know about this from Spacy-layout.

This is powered by vanilla FastAPI, which is good but won't scale and lacks stuff like dynamic batching and autoscaling. I would suggest to use a library specialized for serving ML based APIs like LitServe or RayServe.

gsogol commented 1 week ago

Also, to add to the question above regarding scaling, how do you scale this based hundreds of requests per second? If you're running in the cloud, do you spin up multiple containers?

aniketmaurya commented 1 week ago

Single container: So the benchmark shows BERT-Large model with automatic batching and multiprocessing. A single model process can runs prediction on a batch of 16-32 requests to increase the throughput. Additionally, if GPU memory allows it can also spin up extra process to handle more requests. The requests are load balanced on process level via uvicorn socket.

And yes, in cloud you can also spin up multiple containers for further scale.