elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
635 stars 98 forks source link

Continue the work on batched csv output #579

Closed bartbroere closed 10 months ago

bartbroere commented 1 year ago

In PR #450 @V1NAY8 started working on chunked CSV output, to solve issue #449

Since this is a feature I could really use, I continued the work that was started there, trying to work in some of the suggestions made in the other PR.

A lot has been discussed already in the other PR, but this should help with memory usage. Right now, to export to CSV, the entire Elastic index in the eland Dataframe will be converted to a pandas Dataframe. Only after that is to_csv called. This requires a lot of memory. After this PR, this will be done with multiple calls to to_csv. After the first call, it starts using the append mode (mode="a"). This should have a lower peak memory usage.

In a bit I'll be testing it with a larger index, to see if these assumptions hold up and everything works as expected.

elasticmachine commented 1 year ago

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

cla-checker-service[bot] commented 1 year ago

💚 CLA has been signed

bartbroere commented 1 year ago

@sethmlarson Would you (or a different maintainer) be willing to review this change?

pquentin commented 11 months ago

Hello! Sorry for the lack of feedback. I'm going to help maintain Eland going forward, so feel free to ping me directly. I'll take a look at this next week.

buildkite test this please

pquentin commented 11 months ago

buildkite test this please

pquentin commented 10 months ago

I will merge from main and rerun tests when https://github.com/elastic/eland/pull/627 is merged.

pquentin commented 10 months ago

buildkite test this please