Support PyArrow arrays and dataframes

weiji14 commented 10 months ago

Description of the desired feature

Apache Arrow is an in-memory format that is starting to become a common exchange format between different libraries in Python and other programming languages. For example:

Pandas 3.0 will use pyarrow.string instead of object dtype for strings, see PDEP10, so we will eventually need to support PyArrow (at least for string dtypes)
Polars DataFrames can be zero-copy converted to Pyarrow via the __dataframe__ protocol, see https://arrow.apache.org/docs/python/interchange_protocol.html

This issue is to track compatibility and support of different PyArrow data types in PyGMT:

Dtype	Implementation PR	Status	Notes
Numerical (uint/int/float)	#2774	:white_check_mark:
String	#2933	:construction:	May require modifying the `put_strings` method that currently uses `np.char.encode`
Date/Time	#2845	:construction:	May require modifying `array_to_datetime` that expects Python datetime or numpy-backed arrays, xref #242
Duration	TODO	:x:	https://arrow.apache.org/docs/13.0/python/generated/pyarrow.duration.html, wait for #2884 also
Special case: `geopandas.GeoDataFrame` with PyArrow dtype columns	TODO	:x:	See https://github.com/GenericMappingTools/pygmt/pull/2774#discussion_r1413006621
GeoArrow geometry	TODO	:x:	https://github.com/geoarrow/geoarrow-python

Simplest way of integrating would be to just handle PyArrow-backed pandas.Dataframe objects as above.

Alternatively, we can also discuss about using PyArrow as the internal array representation (which would make pyarrow a hard dependency) since it may allow better interoperability across other Python libraries using Arrow, and this might be relevant for #1318 and #2731. My thought is to do this through the __dataframe__ protocol, see https://arrow.apache.org/docs/python/interchange_protocol.html

Are you willing to help implement and maintain this feature?

Yes, but help is welcome too!

seisman commented 9 months ago

Before supporting pyarrow-backed pandas objects like what you're doing in PR #2774 and #2845, maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.

weiji14 commented 9 months ago

maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.

Sure, I'd love to have direct support for PyArrow arrays too. I started with pyarrow-backed pandas objects because pandas 3.0 will eventually use PyArrow for string columns by default, but no reason we can't support passing a pyarrow.array object directly into PyGMT.

weiji14 commented 9 months ago

check/support passing pyarrow arrays directly to PyGMT

Just opened a PR for this at #2864. Surprisingly, most PyGMT functions already work with pyarrow.array or pyarrow.table without any modification (I've tested blockm, info, nearneighbor, project, triangulate, xyz2grd so far), possibly because PyGMT can convert them internally to numpy.array (see e.g. https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy). Will need to test out more complicated dtypes and check for edge cases, but it's looking promising!

GenericMappingTools / pygmt

Support PyArrow arrays and dataframes #2800

Description of the desired feature

Are you willing to help implement and maintain this feature?