When the provided input for a chunked inference request breaks into many chunks, it's possible that it can exceed the queue size limit (defined either by the user or defaulted to 1000) on the ML node. We previously implemented a fix to "batch the chunks" to avoid hitting this queue limit. This fix waits for each batch of chunks to complete before sending the next one essentailly adding a queuing mechanism on top of the existing queue. Long term we'd like to replace this with a retry strategy on our calls to the queue that would backoff when a queue size limit is hit and try to push to it again after some period of time.
Description
Description
When the provided input for a chunked inference request breaks into many chunks, it's possible that it can exceed the queue size limit (defined either by the user or defaulted to 1000) on the ML node. We previously implemented a fix to "batch the chunks" to avoid hitting this queue limit. This fix waits for each batch of chunks to complete before sending the next one essentailly adding a queuing mechanism on top of the existing queue. Long term we'd like to replace this with a retry strategy on our calls to the queue that would backoff when a queue size limit is hit and try to push to it again after some period of time.