elastic / connectors

Official Elastic connectors for third-party data sources
https://www.elastic.co/guide/en/elasticsearch/reference/master/es-connectors.html
Other
86 stars 138 forks source link

BadRequestError: failed to parse field [indexed_document_volume] of type [integer] #735

Closed prashant-elastic closed 1 year ago

prashant-elastic commented 1 year ago

Bug Description

BadRequestError: 400, 'mapper_parsing_exception' failed to parse field [indexed_document_volume] of type [integer]

To Reproduce

Steps to reproduce the behavior:

  1. Create an index in elasticsearch
  2. Do the necessary changes in config.yml file
  3. Exexcute make run command to execute the connector
  4. Move over to the configurations tab in kibana UI and make sharepoint connector related configurations
  5. Observe sync in progress and wait for it to be completed

Expected behavior

All sharepoint documents should be successfully indexed in Elasticsearch

Actual behavior

BadRequestError: 400, 'mapper_parsing_exception' failed to parse field [indexed_document_volume] of type [integer]

Screenshots

mapper_parsing

Environment

Additional context

[FMWK][12:16:55][INFO] Fetcher <create: 49099 |update: 0 |delete: 0> Exception in callback ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>) handle: <Handle ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>)> Traceback (most recent call last):   File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 131, in execute     await self._sync_done(sync_status=sync_status, sync_error=fetch_error)   File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 170, in _sync_done     await self.sync_job.done(ingestion_stats=ingestion_stats)   File "/home/ubuntu/es-connectors/connectors/byoc.py", line 237, in done     await self._terminate(   File "/home/ubuntu/es-connectors/connectors/byoc.py", line 275, in _terminate     await self.index.update(doc_id=self.id, doc=doc)   File "/home/ubuntu/es-connectors/connectors/es/index.py", line 72, in update     await self.client.update(   File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/init.py", line 4513, in update     return await self.perform_request(  # type: ignore[return-value]   File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/_base.py", line 321, in perform_request     raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', "failed to parse field [indexed_document_volume] of type [integer] in document with id 'XmfoS4cBVSm7nRdw6PPq'. Preview of field's value: '13167265024'")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):   File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run     self._context.run(self._callback, *self._args)   File "/home/ubuntu/es-connectors/connectors/utils.py", line 318, in _callback     raise task.exception()   File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 135, in execute     await self._sync_done(sync_status=JobStatus.ERROR, sync_error=e)   File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 164, in _sync_done     await self.sync_job.fail(sync_error, ingestion_stats=ingestion_stats)   File "/home/ubuntu/es-connectors/connectors/byoc.py", line 242, in fail     await self._terminate(   File "/home/ubuntu/es-connectors/connectors/byoc.py", line 275, in _terminate     await self.index.update(doc_id=self.id, doc=doc)   File "/home/ubuntu/es-connectors/connectors/es/index.py", line 72, in update     await self.client.update(   File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/init.py", line 4513, in update     return await self.perform_request(  # type: ignore[return-value]   File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/_base.py", line 321, in perform_request     raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', "failed to parse field [indexed_document_volume] of type [integer] in document with id 'XmfoS4cBVSm7nRdw6PPq'. Preview of field's value: '13167265024'")

artem-shelkovnikov commented 1 year ago

cc @wangch079

wangch079 commented 1 year ago

Hi @prashant-elastic , may I know which version/branch you are running?

parth-elastic commented 1 year ago

We checked on main branch with elastic v8.7.0 on cloud

parth-elastic commented 1 year ago

Also please note that, The issue is occurring when we are working with the larger data [~10 GB & ~49000 objects]. For small and medium data it is working fine.

wangch079 commented 1 year ago

Also please note that, The issue is occurring when we are working with the larger data [~10 GB & ~49000 objects].

Can I get a copy of the data set?

akanshi-elastic commented 1 year ago

Also please note that, The issue is occurring when we are working with the larger data [~10 GB & ~49000 objects].

Can I get a copy of the data set?

shared you on slack 1:1

khusbu-crest commented 1 year ago

@wangch079 Is there any update on this issue? We are kind of blocked to check the performance of the connectors because of this issue. While Indexing large data sets (~14k), this issue appears and the script execution gets interrupted hence we are unable to complete the performance testing.

ppf2 commented 1 year ago

I can reproduce this on 8.7.0 against a large-ish mySQL dataset here.

image
[FMWK][10:17:48][INFO] Fetcher <create: 1731013 |update: 336386 |delete: 0>
[FMWK][10:17:48][INFO] Fetcher <create: 1731113 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Task 1 - Sending a batch of 1000 ops -- 0.7MiB
[FMWK][10:17:48][INFO] Fetcher <create: 1731213 |update: 336386 |delete: 0>
[FMWK][10:17:48][INFO] Fetcher <create: 1731313 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Bulker stats - no. of docs indexed: 2051841, volume of docs indexed: 3641845800 bytes, no. of docs deleted: 0
[FMWK][10:17:48][INFO] Fetcher <create: 1731413 |update: 336386 |delete: 0>
[FMWK][10:17:48][INFO] Fetcher <create: 1731513 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Polling every 30 seconds
[FMWK][10:17:48][INFO] Fetcher <create: 1731613 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Task 1 - Sending a batch of 1000 ops -- 0.7MiB
[FMWK][10:17:48][DEBUG] Connector UAhYeIcB4DIAFu1tTlw1 natively supported
[FMWK][10:17:48][DEBUG] Sending heartbeat for connector UAhYeIcB4DIAFu1tTlw1
[FMWK][10:17:48][INFO] Fetcher <create: 1731713 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Connector status is Status.ERROR
[FMWK][10:17:48][DEBUG] Filtering of connector UAhYeIcB4DIAFu1tTlw1 is in state valid, skipping...
[FMWK][10:17:48][DEBUG] scheduler is disabled
[FMWK][10:17:48][DEBUG] Scheduling is disabled for connector UAhYeIcB4DIAFu1tTlw1
[FMWK][10:17:48][INFO] Fetcher <create: 1731813 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Bulker stats - no. of docs indexed: 2052341, volume of docs indexed: 3642709800 bytes, no. of docs deleted: 0
[FMWK][10:17:48][CRITICAL] Connector job (ID: aJPBgIcBEquqKqFm0VQP) is not running but in status of JobStatus.ERROR.
Traceback (most recent call last):
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 149, in execute
    await self.check_job()
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 287, in check_job
    raise ConnectorJobNotRunningError(self.job_id, self.sync_job.status)
connectors.sync_job_runner.ConnectorJobNotRunningError: Connector job (ID: aJPBgIcBEquqKqFm0VQP) is not running but in status of JobStatus.ERROR.
[FMWK][10:17:48][INFO] Task is canceled, stop Fetcher...
[FMWK][10:17:48][INFO] Fetcher is stopped.
[FMWK][10:17:48][INFO] Task is canceled, stop Bulker...
[FMWK][10:17:48][INFO] Bulker is stopped.
Exception in callback ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>)
handle: <Handle ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>)>
Traceback (most recent call last):
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 149, in execute
    await self.check_job()
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 287, in check_job
    raise ConnectorJobNotRunningError(self.job_id, self.sync_job.status)
connectors.sync_job_runner.ConnectorJobNotRunningError: Connector job (ID: aJPBgIcBEquqKqFm0VQP) is not running but in status of JobStatus.ERROR.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "<path>/connectors-python/connectors/utils.py", line 315, in _callback
    raise task.exception()
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 162, in execute
    await self._sync_done(sync_status=JobStatus.ERROR, sync_error=e)
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 196, in _sync_done
    await self.sync_job.fail(sync_error, ingestion_stats=ingestion_stats)
  File "<path>/connectors-python/connectors/byoc.py", line 240, in fail
    await self._terminate(
  File "<path>/connectors-python/connectors/byoc.py", line 273, in _terminate
    await self.index.update(doc_id=self.id, doc=doc)
  File "<path>/connectors-python/connectors/es/index.py", line 71, in update
    return await self.client.update(
  File "<path>/connectors-python/lib/python3.10/site-packages/elasticsearch/_async/client/__init__.py", line 4586, in update
    return await self.perform_request(  # type: ignore[return-value]
  File "<path>/connectors-python/lib/python3.10/site-packages/elasticsearch/_async/client/_base.py", line 320, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', "failed to parse field [indexed_document_volume] of type [integer] in document with id 'aJPBgIcBEquqKqFm0VQP'. Preview of field's value: '3642709800'")

@danajuratoni Would be nice to address this one before GA.

artem-shelkovnikov commented 1 year ago

I think we're overflowing the integer field for indexed_document_volume and need a bigger data type to be able to store the bytes. Additionally we can switch to storing KB instead of B.

wangch079 commented 1 year ago

As @artem-shelkovnikov pointed out, we use integer for indexed_document_volume, which only supports maximum 2^31-1. indexed_document_volume stores the number of bytes, so the maximum is only around 2 GB+. We can change it to unsigned_long, which supports a maximum 2^64-1, which is around 18 Exa Bytes (1000 PB).

cc. @danajuratoni This will make any connector trying to sync any source with more than 2 GB of data fail in Ruby (since 8.6) and Python (since 8.7.1). do you think we should document it?

8.7.1 is not released yet, but I don't think this can be considered a blocker. We could fix it in 8.8

wangch079 commented 1 year ago

For @ppf2 reported issue, this is because the job is not seeing any update for more than 60 seconds (supposed to receive a heartbeat every 10 seconds), and is marked as error. But the job is actually running.

This can happen when the job reporting task got no chance to run for more than 60 seconds, which is rare. I will look into this issue separately.

seanstory commented 1 year ago

Based on this slack thread, I'm reverting the above 3 PRs.

Instead, in a separate set of PRs, Chenhui and I will change these fields from representing byte counts to representing MB counts. This raises our limit from ~2GB to ~2PB, which seems much less likely to constrain us.

wangch079 commented 1 year ago

Regarding this issue: https://github.com/elastic/connectors-python/issues/735#issuecomment-1510993433, I tested locally but I can't reproduce it. I guess somehow the sync was stuck somewhere for more than 60 seconds, causing the job marked as idle.

wangch079 commented 1 year ago

Close this issue as all the fixes have been merged.