elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
25 stars 99 forks source link

Optimize query result serialization into DataFrames #328

Open jbkze opened 4 years ago

jbkze commented 4 years ago

Hey,

overall I really like eland, but I noticed that creating a pandas DataFrame is much slower with the eland_to_pandas() method compared to the "naive" way of doing multiple scan() calls on an elasticsearch_dsl query (about 4-5 times slower)

Here is an example (~80.000 rows, 5 cols):

proxy_df = eland.DataFrame(es_client=self.es_client, es_index_pattern=self.index)

df = eland.eland_to_pandas(ed_df=proxy_df)

>> Elapsed time: 153.82s

versus

from elasticsearch_dsl import Search

s = Search(using=self.es_client, index=self.index)

df = pd.DataFrame((d.to_dict()) for d in s.scan())

>> Elapsed time: 33.50s

Is there any chance that the conversion to pandas could be accelerated?

sethmlarson commented 4 years ago

Thanks for opening this, I'll take a look and see if there are any optimizations to be made.

sethmlarson commented 3 years ago

The area that needs most optimizing is the QueryParams._flatten_dict() method and the Mapping class.