elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
627 stars 98 forks source link

Implement eland.DataFrame.to_json #661

Closed bartbroere closed 4 months ago

bartbroere commented 5 months ago

Dumping an Elastic index was the right solution for me at first. After a while csv did not offer all the guarantees I was looking for, since csv records for example can span multiple lines if any of the values contain line breaks.

For that reason, JSON lines was a more suitable format. Pandas' DataFrame.to_json can generate ".jsonl" files by suppling lines=True, orient='records'. This also lets us reuse the earlier solution that streams output from Elastic and appends it to a file, eliminating the need to be able to fit the entire dataset in memory.

This pull request implements streaming an Elasticsearch index to .jsonl, while falling back to running to_pandas().to_json(...) if streaming is a bit harder to do.

pquentin commented 5 months ago

buildkite test this please

bartbroere commented 5 months ago

Thanks! LGTM.

Nice! I removed the two linting errors. Sorry about that.

pquentin commented 5 months ago

buildkite test this please

pquentin commented 4 months ago

buildkite test this please