edanalytics / earthmover

CLI tool for transforming collections of tabular source data into a variety of text-based data formats via YAML configuration and Jinja templates.
Apache License 2.0
19 stars 2 forks source link

Fix escape chars in output when `linearize: False` #98

Closed tomreitz closed 3 months ago

tomreitz commented 3 months ago

This PR fixes a bug introduced in earthmover 0.2.0. Writing out destination files was done with Dask.to_csv() and some "invisible" escape characters, however when a destination declared linearize: False (which prevents stripping out newline characters from each rendered Jinja template) these "invisible" characters were present in the output file.

Now, like we render_row() to render a Jinja template for each row of a dataframe, we similarly write_row() to append the rendered row to an open file handle.

I tested performance by loading a 31MB, 1M-row, 3-column CSV file and immediately writing it to a destination with the following template:

{
    {% for key, value in __row_data__.items() -%}
    "{{key}}": "{{value|trim}}"{% if not loop.last %},{% endif %}
    {% endfor %}
}

Results:

So the new method at least as fast as .to_csv(). I also checked the output file for malformed lines (in case different Dask processes writing to the same file conflicted somehow) but it looked fine.