Open weiji14 opened 10 months ago
Before supporting pyarrow-backed pandas objects like what you're doing in PR #2774 and #2845, maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.
maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.
Sure, I'd love to have direct support for PyArrow arrays too. I started with pyarrow-backed pandas objects because pandas 3.0 will eventually use PyArrow for string columns by default, but no reason we can't support passing a pyarrow.array object directly into PyGMT.
check/support passing pyarrow arrays directly to PyGMT
Just opened a PR for this at #2864. Surprisingly, most PyGMT functions already work with pyarrow.array
or pyarrow.table
without any modification (I've tested blockm
, info
, nearneighbor
, project
, triangulate
, xyz2grd
so far), possibly because PyGMT can convert them internally to numpy.array
(see e.g. https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy). Will need to test out more complicated dtypes and check for edge cases, but it's looking promising!
Description of the desired feature
Apache Arrow is an in-memory format that is starting to become a common exchange format between different libraries in Python and other programming languages. For example:
pyarrow.string
instead ofobject
dtype for strings, see PDEP10, so we will eventually need to support PyArrow (at least for string dtypes)__dataframe__
protocol, see https://arrow.apache.org/docs/python/interchange_protocol.htmlThis issue is to track compatibility and support of different PyArrow data types in PyGMT:
put_strings
method that currently usesnp.char.encode
array_to_datetime
that expects Python datetime or numpy-backed arrays, xref #242geopandas.GeoDataFrame
with PyArrow dtype columnsSimplest way of integrating would be to just handle PyArrow-backed
pandas.Dataframe
objects as above.Alternatively, we can also discuss about using PyArrow as the internal array representation (which would make
pyarrow
a hard dependency) since it may allow better interoperability across other Python libraries using Arrow, and this might be relevant for #1318 and #2731. My thought is to do this through the__dataframe__
protocol, see https://arrow.apache.org/docs/python/interchange_protocol.htmlFurther reading:
Are you willing to help implement and maintain this feature?
Yes, but help is welcome too!