Dumping an Elastic index was the right solution for me at first. After a while csv did not offer all the guarantees I was looking for, since csv records for example can span multiple lines if any of the values contain line breaks.
For that reason, JSON lines was a more suitable format. Pandas' DataFrame.to_json can generate ".jsonl" files by suppling lines=True, orient='records'. This also lets us reuse the earlier solution that streams output from Elastic and appends it to a file, eliminating the need to be able to fit the entire dataset in memory.
This pull request implements streaming an Elasticsearch index to .jsonl, while falling back to running to_pandas().to_json(...) if streaming is a bit harder to do.
Dumping an Elastic index was the right solution for me at first. After a while csv did not offer all the guarantees I was looking for, since csv records for example can span multiple lines if any of the values contain line breaks.
For that reason, JSON lines was a more suitable format. Pandas'
DataFrame.to_json
can generate ".jsonl" files by supplinglines=True, orient='records'
. This also lets us reuse the earlier solution that streams output from Elastic and appends it to a file, eliminating the need to be able to fit the entire dataset in memory.This pull request implements streaming an Elasticsearch index to
.jsonl
, while falling back to runningto_pandas().to_json(...)
if streaming is a bit harder to do.