Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.63k stars 2.84k forks source link

Enhanced Batching and Progress Reporting in SearchIndexingBufferedSender #33469

Open farzad528 opened 11 months ago

farzad528 commented 11 months ago

Is your feature request related to a problem? Please describe. When using SearchIndexingBufferedSender to upload a large number of documents (such as 1 billion vectors with text), the current implementation doesn't automatically handle the 64 MB payload size limit and the 32,000 documents count limit per batch. This leads to a need for manual intervention to ensure that documents are indexed successfully, which can be particularly frustrating when running long indexing jobs, potentially overnight.

Describe the solution you'd like I propose the following enhancements to the SearchIndexingBufferedSender to make the experience of indexing large datasets more seamless and robust:

Describe alternatives you've considered An alternative approach would be to manually chunk the data and implement custom logic for progress reporting and error handling. However, this adds complexity and requires additional development effort, which could be avoided if the SDK provided these capabilities out of the box.

Additional context The goal is to enable users to confidently run a single function to index a massive number of documents without needing to worry about the internal limitations of the service. Enhancing the SearchIndexingBufferedSender with the above features would significantly improve the developer experience and the reliability of the indexing process for large datasets.

farzad528 commented 11 months ago

cc: @mattmsft @BevLoh

mattgotteiner commented 11 months ago

@xiangyan99 I’m recalling we had some mechanisms to do this using search indexing buffered sender, not sure if we’re missing a sample on how to do this or if there’s missing functionality? I could be remembering wrong

xiangyan99 commented 11 months ago

Per our guideline, we don't do client side validation.

Let's work with the architect to see how to address it.