elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
640 stars 98 forks source link

Ignore broken datetime strings on eleasticsearch #626

Open weidenka opened 11 months ago

weidenka commented 11 months ago

For me this fixes an error related to a wrong format (1-01-01 00:00:00 ) of a single timestamp on the ES side. I don't see a disadvantage excluding those data points in the conversion.

Stacktrace

Traceback (most recent call last):
  File "/mypath/env/lib/python3.9/site-packages/eland/common.py", line 135, in elasticsearch_date_to_pandas_date
    return pd.to_datetime(
  File "/mypath/env/lib/python3.9/site-packages/pandas/core/tools/datetimes.py", line 1102, in to_datetime
    result = convert_listlike(np.array([arg]), format)[0]
  File "/mypath/env/lib/python3.9/site-packages/pandas/core/tools/datetimes.py", line 393, in _convert_listlike_datetimes
    return _to_datetime_with_unit(arg, unit, name, tz, errors)
  File "/mypath/env/lib/python3.9/site-packages/pandas/core/tools/datetimes.py", line 557, in _to_datetime_with_unit
    arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
  File "pandas/_libs/tslib.pyx", line 364, in pandas._libs.tslib.array_with_unit_to_datetime
ValueError: non convertible value 0001-01-01T00:00:00+00:00 with the unit 'ms'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mypath/scripts/score_imputer.py", line 19, in <module>
    korro_data = query_data_from_elastic(use_cache=True)
  File "/mypath/daprod_health_data/korro_data.py", line 39, in query_data_from_elastic
    df = ed.eland_to_pandas(elastic_df)
  File "/mypath/env/lib/python3.9/site-packages/eland/etl.py", line 292, in eland_to_pandas
    return ed_df.to_pandas(show_progress=show_progress)
  File "/mypath/env/lib/python3.9/site-packages/eland/dataframe.py", line 1351, in to_pandas
    return self._query_compiler.to_pandas(show_progress=show_progress)
  File "/mypath/env/lib/python3.9/site-packages/eland/query_compiler.py", line 506, in to_pandas
    return self._operations.to_pandas(self, show_progress)
  File "/mypath/env/lib/python3.9/site-packages/eland/operations.py", line 1226, in to_pandas
    for df in self.search_yield_pandas_dataframes(query_compiler=query_compiler):
  File "/mypath/env/lib/python3.9/site-packages/eland/operations.py", line 1278, in search_yield_pandas_dataframes
    df = query_compiler._es_results_to_pandas(hits)
  File "/mypath/env/lib/python3.9/site-packages/eland/query_compiler.py", line 268, in _es_results_to_pandas
    rows.append(self._flatten_dict(row, field_mapping_cache))
  File "/mypath/env/lib/python3.9/site-packages/eland/query_compiler.py", line 348, in _flatten_dict
    flatten(y)
  File "/mypath/env/lib/python3.9/site-packages/eland/query_compiler.py", line 312, in flatten
    flatten(x[a], name + a + ".")
  File "/mypath/env/lib/python3.9/site-packages/eland/query_compiler.py", line 322, in flatten
    x = elasticsearch_date_to_pandas_date(
  File "/mypath/env/lib/python3.9/site-packages/eland/common.py", line 139, in elasticsearch_date_to_pandas_date
    return pd.to_datetime(value)
  File "/mypath/env/lib/python3.9/site-packages/pandas/core/tools/datetimes.py", line 1102, in to_datetime
    result = convert_listlike(np.array([arg]), format)[0]
  File "/mypath/env/lib/python3.9/site-packages/pandas/core/tools/datetimes.py", line 438, in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
  File "/mypath/env/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py", line 2177, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 427, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 678, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 674, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 649, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslibs/np_datetime.pyx", line 212, in pandas._libs.tslibs.np_datetime.check_dts_bounds
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 00:00:00 present at position 0

Process finished with exit code 1
cla-checker-service[bot] commented 11 months ago

❌ Author of the following commits did not sign a Contributor Agreement: 521cf6f12041b77fb4b7051d710fc716e0a4070d

Please, read and sign the above mentioned agreement if you want to contribute to this project

weidenka commented 11 months ago

You silently ignore errors in ES, yes. For me that sounds ok. Adding an error parameter would be better, I agree. Do you plan this for the near future?