As embedding models are often used in conjunction with RAG, following functionalities would be helpful
[ ] Support for very large POST requests (in applications running on langchain, you would call GradientEmbeddings().embed_documents(docs), where docs is a list of 200_000+ documents. As it's challenging to add them all into a single POST, other clients perform micro-batching to solve this. I have seen quite some bad 3rd party implementations, often with the following two mistakes:
As embedding models are often used in conjunction with RAG, following functionalities would be helpful
[ ] Support for very large POST requests (in applications running on langchain, you would call
GradientEmbeddings().embed_documents(docs)
, wheredocs
is a list of 200_000+ documents. As it's challenging to add them all into a single POST, other clients perform micro-batching to solve this. I have seen quite some bad 3rd party implementations, often with the following two mistakes:[ ] Support of async (e.g. aiohttp), multiple requests by uses can be handeled. (query mode)
I implemented a minimal client which supports the above features https://gist.github.com/michaelfeil/8c6829158e7dd6bb5fdf805b5ea2978c
If async is not on the roadmap, I would at least suggest to split documents in sorted order and send them with a
ThreadPool