I tried using eland to read data from two data streams, with es_index_pattern=["*java.backend*", "*h3c*"] , where field 'data_stream.dataset' is the name of the data stream of the document, and its value are 'h3c' and 'java.backend' in this example.
When I use 'df' to print the dataframe, I can indeed see 'h3c' data in the printed data, but when I use value_couts() for this field, only 'java.backend' appeared. I'm not sure whether this is a bug, because i saw a warning about this field when create the eland.DataFrame.
The code and returns are in the floowing:
>>> import eland as ed
>>> from elasticsearch import Elasticsearch
>>> import pandas as pd
>>> escli = Elasticsearch(
... hosts="https://******",
... basic_auth=("elastic", "***"),
... ca_certs='./http_ca.crt',
... )
>>> df = ed.DataFrame(
... escli,
... es_index_pattern=["*java.backend*", "*h3c*"],
... columns=['@timestamp', 'message', 'data_stream.dataset'],
... es_index_field='@timestamp'
... )
# here is the warning mentioned before
......
xxxx\lib\site-packages\eland\field_mappings.py:327: UserWarning: Field data_stream.dataset has conflicting types ('constant_keyword', None) != text
......
# here 'data_stream.dataset' has both value of 'h3c' and 'java.backend'
>>> df
@timestamp ... data_stream.dataset
2012-12-31T23:59:33.000+08:00 2012-12-31 23:59:33+08:00 ... h3c
2012-12-31T23:59:33.000+08:00 2012-12-31 23:59:33+08:00 ... h3c
2012-12-31T23:59:48.000+08:00 2012-12-31 23:59:48+08:00 ... h3c
2012-12-31T23:59:48.000+08:00 2012-12-31 23:59:48+08:00 ... h3c
2012-12-31T23:59:48.000+08:00 2012-12-31 23:59:48+08:00 ... h3c
... ... ... ...
2023-12-19T07:00:08.730Z 2023-12-19 07:00:08.730000+00:00 ... java.backend
2023-12-19T07:00:08.730Z 2023-12-19 07:00:08.730000+00:00 ... java.backend
2023-12-19T07:00:08.730Z 2023-12-19 07:00:08.730000+00:00 ... java.backend
2023-12-19T07:00:08.730Z 2023-12-19 07:00:08.730000+00:00 ... java.backend
2023-12-19T07:38:46.967Z 2023-12-19 07:38:46.967000+00:00 ... java.backend
[42240705 rows x 3 columns]
# but here value_counts() only return info of 'java.backend'
>>> df['data_stream.dataset'].value_counts()
java.backend 42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(10)
java.backend 42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(2)
java.backend 42043023
Name: data_stream.dataset, dtype: int64
I tried using eland to read data from two data streams, with
es_index_pattern=["*java.backend*", "*h3c*"]
, where field 'data_stream.dataset' is the name of the data stream of the document, and its value are 'h3c' and 'java.backend' in this example. When I use 'df' to print the dataframe, I can indeed see 'h3c' data in the printed data, but when I use value_couts() for this field, only 'java.backend' appeared. I'm not sure whether this is a bug, because i saw a warning about this field when create the eland.DataFrame.The code and returns are in the floowing: