Closed justinusmenzel closed 3 years ago
Thanks for the feature request. I've added this request to our back log. We'll update this ticket with more details after we've gotten a chance to prioritize this ticket against the other items in our backlog.
The queueing functionality already occurs within the stream itself per DynamoDB shard. The parallelization issue you are facing is occurring across different DynamoDB shards. By default there will be 1 lambda for each shard. Meaning if you have 50 shards at most you could have 50 lambda's trying to write to your OpenSearch cluster at a given time.
Some potential approaches:
reserved concurrency
on the ddbToEs
lambda to cap out at some limit
reservedConcurrency: <your #>
to the ddbToEs
functionPutting an SQS between DDB and OpenSearch would run into the same problem, where the lambda would quickly spin up to consume the messages on the SQS, so you would need to potentially use one of the suggestions above too.
Closing due to no response.
Problem statement
During large batch updates Elasticsearch sometimes can't keep up with updates if a lot of them are done in parallel through the existing mechanism where Dynamo DB write events are being received by a lambda which writes to Elasticsearch in parallel.
Proposed change
The lambda triggered by Dynamo DB write events should just write each event into a SQS message queue instead of trying to sync the data directly to Elasticsearch. (Maybe Dynamo DB can be configured so the write events are being fed into SQS without the need of a lambda) Another lambda picks the messages up from SQS and sends the updates to Elasticsearch. The main point of using a message queue is the decoupling of the Dynamo DB write events from the writes to Elasticsearch which tend to be slower. To accommodate to the slower processing speed of Elasticsearch the lambda needs to throttle the message consumption rate and the message queue needs to be able to retain all unprocessed messages. That number could grow quite large. Eventually though all messages should be processed and every write event sent to Elasticsearch.