This PR fixes a bug introduced in earthmover 0.2.0. Writing out destination files was done with Dask.to_csv() and some "invisible" escape characters, however when a destination declared linearize: False (which prevents stripping out newline characters from each rendered Jinja template) these "invisible" characters were present in the output file.
Now, like we render_row() to render a Jinja template for each row of a dataframe, we similarly write_row() to append the rendered row to an open file handle.
I tested performance by loading a 31MB, 1M-row, 3-column CSV file and immediately writing it to a destination with the following template:
{
{% for key, value in __row_data__.items() -%}
"{{key}}": "{{value|trim}}"{% if not loop.last %},{% endif %}
{% endfor %}
}
Results:
old .to_csv() method with linearize: True took 636s
old .to_csv() method with linearize: False took 456s
new map_partitions() method with linearize: True took 441s
new map_partitions() method with linearize: False took 431s
So the new method at least as fast as .to_csv(). I also checked the output file for malformed lines (in case different Dask processes writing to the same file conflicted somehow) but it looked fine.
This PR fixes a bug introduced in earthmover 0.2.0. Writing out destination files was done with
Dask.to_csv()
and some "invisible" escape characters, however when a destination declaredlinearize: False
(which prevents stripping out newline characters from each rendered Jinja template) these "invisible" characters were present in the output file.Now, like we
render_row()
to render a Jinja template for each row of a dataframe, we similarlywrite_row()
to append the rendered row to an open file handle.I tested performance by loading a 31MB, 1M-row, 3-column CSV file and immediately writing it to a destination with the following template:
Results:
.to_csv()
method withlinearize: True
took 636s.to_csv()
method withlinearize: False
took 456smap_partitions()
method withlinearize: True
took 441smap_partitions()
method withlinearize: False
took 431sSo the new method at least as fast as
.to_csv()
. I also checked the output file for malformed lines (in case different Dask processes writing to the same file conflicted somehow) but it looked fine.