elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
640 stars 98 forks source link

A possible issue with eland.Dataframe.value_counts(), the statistical information is missing some values #643

Open mumuwithw opened 9 months ago

mumuwithw commented 9 months ago

I tried using eland to read data from two data streams, with es_index_pattern=["*java.backend*", "*h3c*"] , where field 'data_stream.dataset' is the name of the data stream of the document, and its value are 'h3c' and 'java.backend' in this example. When I use 'df' to print the dataframe, I can indeed see 'h3c' data in the printed data, but when I use value_couts() for this field, only 'java.backend' appeared. I'm not sure whether this is a bug, because i saw a warning about this field when create the eland.DataFrame.

The code and returns are in the floowing:

>>> import eland as ed
>>> from elasticsearch import Elasticsearch
>>> import pandas as pd
>>> escli = Elasticsearch(
...         hosts="https://******",
...         basic_auth=("elastic", "***"),
...         ca_certs='./http_ca.crt',
...     )
>>> df = ed.DataFrame(
...     escli,
...     es_index_pattern=["*java.backend*", "*h3c*"],
...     columns=['@timestamp', 'message', 'data_stream.dataset'],
...     es_index_field='@timestamp'
...     )

# here is the warning mentioned before
......
xxxx\lib\site-packages\eland\field_mappings.py:327: UserWarning: Field data_stream.dataset has conflicting types ('constant_keyword', None) != text
......

# here 'data_stream.dataset' has both value of 'h3c' and 'java.backend'
>>> df
                                                     @timestamp  ... data_stream.dataset
2012-12-31T23:59:33.000+08:00         2012-12-31 23:59:33+08:00  ...                 h3c
2012-12-31T23:59:33.000+08:00         2012-12-31 23:59:33+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
...                                                         ...  ...                 ...
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:38:46.967Z       2023-12-19 07:38:46.967000+00:00  ...        java.backend

[42240705 rows x 3 columns]

# but here value_counts() only return info of 'java.backend'
>>> df['data_stream.dataset'].value_counts()
java.backend    42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(10) 
java.backend    42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(2)  
java.backend    42043023
Name: data_stream.dataset, dtype: int64