apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.24k stars 3.47k forks source link

[Python] Support serialization of Arrow files on disk without the identifier "Feather" #38515

Open jason-s opened 10 months ago

jason-s commented 10 months ago

Describe the enhancement requested

The documentation for Arrow Columnar Format suggests that the separate Feather project has been subsumed into Arrow, and that it (Feather) is really just the canonical serialization format for Arrow tables:

We recommend the “.arrow” extension for files created with this format. Note that files created with this format are sometimes called “Feather V2” or with the “.feather” extension, the name and the extension derived from “Feather (V1)”, which was a proof of concept early in the Arrow project for language-agnostic fast data frame storage for Python (pandas) and R.

The Python support of Arrow serialization still uses the identifier feather: (see the Cookbook)

Once we have a table, it can be written to a Feather File using the functions provided by the pyarrow.feather module

import pyarrow.feather as ft

ft.write_feather(table, 'example.feather')

This functionality should be kept as is, for backwards compatibility, but I wonder if the pyarrow module should just have a write() function, without requiring the need to import the pyarrow.feather package or use the term feather. This would help to reduce confusion about file extensions and the relationship between "Arrow" and "Feather".

Component(s)

Python

jason-s commented 10 months ago

See also https://github.com/apache/arrow-cookbook/issues/329