Enhanced Batching and Progress Reporting in SearchIndexingBufferedSender

farzad528 commented 11 months ago

Is your feature request related to a problem? Please describe. When using SearchIndexingBufferedSender to upload a large number of documents (such as 1 billion vectors with text), the current implementation doesn't automatically handle the 64 MB payload size limit and the 32,000 documents count limit per batch. This leads to a need for manual intervention to ensure that documents are indexed successfully, which can be particularly frustrating when running long indexing jobs, potentially overnight.

Describe the solution you'd like I propose the following enhancements to the SearchIndexingBufferedSender to make the experience of indexing large datasets more seamless and robust:

Automatic Handling of Batch Size: Implement an internal mechanism that calculates the cumulative size of the document payload and the count, and automatically flushes the batch when either limit is approached. This should prevent users from encountering issues related to payload size or document count limits.
Customizable Batch Size: Expose a parameter that allows users to specify their desired batch size or payload size limit. This would give users more control over the batching process and could be used to adjust performance based on their specific use case.
Progress Reporting: Introduce built-in progress reporting for the upload operation, similar to the tqdm library in Python, which provides a visual progress bar. This feature would be extremely helpful for long-running indexing jobs, giving users a clear indication of the operation's progress.

Describe alternatives you've considered An alternative approach would be to manually chunk the data and implement custom logic for progress reporting and error handling. However, this adds complexity and requires additional development effort, which could be avoided if the SDK provided these capabilities out of the box.

Additional context The goal is to enable users to confidently run a single function to index a massive number of documents without needing to worry about the internal limitations of the service. Enhancing the SearchIndexingBufferedSender with the above features would significantly improve the developer experience and the reliability of the indexing process for large datasets.

farzad528 commented 11 months ago

cc: @mattmsft @BevLoh

mattgotteiner commented 11 months ago

@xiangyan99 I’m recalling we had some mechanisms to do this using search indexing buffered sender, not sure if we’re missing a sample on how to do this or if there’s missing functionality? I could be remembering wrong

xiangyan99 commented 11 months ago

Per our guideline, we don't do client side validation.

Let's work with the architect to see how to address it.

Azure / azure-sdk-for-python

Enhanced Batching and Progress Reporting in SearchIndexingBufferedSender #33469