GenericMappingTools / pygmt

A Python interface for the Generic Mapping Tools.
https://www.pygmt.org
BSD 3-Clause "New" or "Revised" License
749 stars 217 forks source link

Support PyArrow arrays and dataframes #2800

Open weiji14 opened 10 months ago

weiji14 commented 10 months ago

Description of the desired feature

Apache Arrow is an in-memory format that is starting to become a common exchange format between different libraries in Python and other programming languages. For example:

This issue is to track compatibility and support of different PyArrow data types in PyGMT:

Dtype Implementation PR Status Notes
Numerical (uint/int/float) #2774 :white_check_mark:
String #2933 :construction: May require modifying the put_strings method that currently uses np.char.encode
Date/Time #2845 :construction: May require modifying array_to_datetime that expects Python datetime or numpy-backed arrays, xref #242
Duration TODO :x: https://arrow.apache.org/docs/13.0/python/generated/pyarrow.duration.html, wait for #2884 also
Special case: geopandas.GeoDataFrame with PyArrow dtype columns TODO :x: See https://github.com/GenericMappingTools/pygmt/pull/2774#discussion_r1413006621
GeoArrow geometry TODO :x: https://github.com/geoarrow/geoarrow-python

Simplest way of integrating would be to just handle PyArrow-backed pandas.Dataframe objects as above.

Alternatively, we can also discuss about using PyArrow as the internal array representation (which would make pyarrow a hard dependency) since it may allow better interoperability across other Python libraries using Arrow, and this might be relevant for #1318 and #2731. My thought is to do this through the __dataframe__ protocol, see https://arrow.apache.org/docs/python/interchange_protocol.html

Further reading:

Are you willing to help implement and maintain this feature?

Yes, but help is welcome too!

seisman commented 9 months ago

Before supporting pyarrow-backed pandas objects like what you're doing in PR #2774 and #2845, maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.

weiji14 commented 9 months ago

maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.

Sure, I'd love to have direct support for PyArrow arrays too. I started with pyarrow-backed pandas objects because pandas 3.0 will eventually use PyArrow for string columns by default, but no reason we can't support passing a pyarrow.array object directly into PyGMT.

weiji14 commented 9 months ago

check/support passing pyarrow arrays directly to PyGMT

Just opened a PR for this at #2864. Surprisingly, most PyGMT functions already work with pyarrow.array or pyarrow.table without any modification (I've tested blockm, info, nearneighbor, project, triangulate, xyz2grd so far), possibly because PyGMT can convert them internally to numpy.array (see e.g. https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy). Will need to test out more complicated dtypes and check for edge cases, but it's looking promising!