Enchancements to Embedding: async and mini-batching or for large vector stores

As embedding models are often used in conjunction with RAG, following functionalities would be helpful

[ ] Support for very large POST requests (in applications running on langchain, you would call GradientEmbeddings().embed_documents(docs), where docs is a list of 200_000+ documents. As it's challenging to add them all into a single POST, other clients perform micro-batching to solve this. I have seen quite some bad 3rd party implementations, often with the following two mistakes:
- [ ] forget to sort the documents by length, and instead split the mini-batches as they arrive https://github.com/langchain-ai/langchain/blob/b0f21e2b50a46f74c0847f8efefa5cb7bada5a6b/libs/langchain/langchain/embeddings/baidu_qianfan_endpoint.py#L116
- [ ] forget to send the requests in parallel or async, up to e.g. 16/32/100 parallel requests, but in a blocking for loop: https://github.com/langchain-ai/langchain/blob/b0f21e2b50a46f74c0847f8efefa5cb7bada5a6b/libs/langchain/langchain/embeddings/baidu_qianfan_endpoint.py#L119-L120 I would assume implementing both would lead to a speed-up of 5-30x and 2-3x better GPU efficiency/latency compared to clients like https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/embeddings/baidu_qianfan_endpoint.py
[ ] Support of async (e.g. aiohttp), multiple requests by uses can be handeled. (query mode)

If async is not on the roadmap, I would at least suggest to split documents in sorted order and send them with a ThreadPool

Preemo-Inc / gradientai-python-sdk