ApiError(429, 'TOO_MANY_REQUESTS') while indexing the records in Elasticsearch

prashant-elastic commented 1 year ago

Bug Description

Getting 429 too many requests error while indexing the sharepoint documents (large data set 2.5M) to elasticsearch

To Reproduce

Steps to reproduce the behavior:

Take the latest code of main branch of github connectors-python repository
Do all the necessary configuration changes in config.yml file
On the Kibana, create an index > go to configuration tab > make config changes related to the sharepoint connector

Expected behavior

All records should be properly indexed in Elasticsearch

Actual behavior

429, too many requests error faced by the user as below and status of sync is Sync failure ApiError(429, "{'_shards': {'total': 2, 'successful': 0, 'failed': 2, 'failures': [{'shard': 0, 'index': '.elastic-connectors-v1', 'status': 'TOO_MANY_REQUESTS', 'reason': {'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [indices:admin/refresh[s]] would be [1051923900/1003.1mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1051923680/1003.1mb], new bytes reserved: [220/220b], usages [model_inference=0/0b, inflight_requests=32840454/31.3mb, request=0/0b, fielddata=247/247b, eql_sequence=0/0b]', 'bytes_wanted': 1051923900, 'bytes_limit': 1020054732, 'durability': 'TRANSIENT'}}]}}")

Only 138440 docs got indexed out of 250000 records

Screenshots

Note - This test has been executed on sharepoint server. Attaching log files for more reference

OS: CentOS 2 core CPU 18 GB RAM performance_GITHUB.log

prashant-elastic commented 1 year ago

We also checked this on 8.8 branch of github and faced the same issue.

danajuratoni commented 1 year ago

cc: @vidok

artem-shelkovnikov commented 1 year ago

We don't handle this sort of throttling when uploading to Elasticsearch - the error says that Elasticsearch used all its memory to ingest data and for now cannot ingest more data and we need to wait.

We need to add this error handling into framework.

artem-shelkovnikov commented 1 year ago

For now if you need to go on with your testing, just increase the memory available to Elasticsearch to double of your current value (I see you're allocating 1GB RAM to Elasticsearch which is too little)

artem-shelkovnikov commented 1 year ago

@danajuratoni the problem is not Sharepoint-specific too, it's a framework issue

prashant-elastic commented 1 year ago

Hey @artem-shelkovnikov Please find the screenshot attached which shows the configuration for the Elasticsearch cloud deployment that we used for testing. Do you recommend having an instance with any other configuration?

image (4)

artem-shelkovnikov commented 1 year ago

Hi @prashant-elastic, indeed - you can see that Master node is 1GB, you need to choose configuration with bigger master node if you want the error to go away while we're addressing the problem.

prashant-elastic commented 1 year ago

Hoi @artem-shelkovnikov Sure, I will try configuring an instance with bigger master node.

prashant-elastic commented 1 year ago

Hey @artem-shelkovnikov We tried looking for a way to configure an instance with bigger master node but did not find any luck. Can you please let us know from where to configure the same?

ppf2 commented 1 year ago

Try using 3 zones with 2G size per zone.

I thought we have some form of retry policy in place for backpressure from Elasticsearch on bulk indexing. Based on the attached log file, is the bug here that the general retry mechanism is not working at the framework level?

artem-shelkovnikov commented 1 year ago

I thought we have some form of retry policy in place for backpressure from Elasticsearch on bulk indexing. Based on the attached log file, is the bug here that the general retry mechanism is not working at the framework level?

I think we don't have one at all or it's broken

ppf2 commented 1 year ago

Seems like it retries 3 times with no delay on everything except for conflict errors and gives up?

artem-shelkovnikov commented 1 year ago

It retries 3 times only for conflict errors - only ConflictError is caught in except block, all other errors are just raised immediately

ppf2 commented 1 year ago

Here's an example retry policy for Elasticsearch bulk requests to consider:

Retry up to a threshold (or potentially indefinitely) on temporary unavailability of Elasticsearch (e.g. 429s and 503s)
Do not retry 409s by default, or the current 3 retries is fine. Ensure that the problematic document is logged after the retries.
Do not retry on permanent errors like 400 and 404 errors but ensure that information about the document that is dropped is logged

artem-shelkovnikov commented 1 month ago

Closing as we've update backpressure logic to retry transparently

elastic / connectors