Open bretttully opened 1 week ago
:x: GitHub issue #53011 could not be retrieved.
:warning: GitHub issue #39914 has been automatically assigned in GitHub to PR creator.
By switching the logical ordering, it means that we don't need to call
_pandas_api.pandas_dtype(dtype)
when using the pyarrow backend,
And because you added a name not in ext_columns
to the subsequent methods to fill ext_columns
, this should preserve the priority of the different methods to determine the pandas dtype? (metadata < pyarrow extension type < types_mapper)
By switching the logical ordering, it means that we don't need to call
_pandas_api.pandas_dtype(dtype)
when using the pyarrow backend,And because you added a
name not in ext_columns
to the subsequent methods to fillext_columns
, this should preserve the priority of the different methods to determine the pandas dtype? (metadata < pyarrow extension type < types_mapper)
Yes, exactly. Priority remains the same, but functions are skipped if the field already has a type, meaning that the code causing the error is no longer called if types_mapper is provided.
The test_dlpack
failure in the tests you can ignore (https://github.com/apache/arrow/issues/44728)
Thanks @jorisvandenbossche -- is the process that I can merge this following approval, or is that done by a core maintainer?
is the process that I can merge this following approval, or is that done by a core maintainer?
A committer will merge, probably @jorisvandenbossche in this specific case, once everything is running and addressed. I've triggered CI for the latest changes.
@github-actions crossbow submit -g python
Revision: e3b9892e5663f4888bc79c38b9d36bbefcdaf2b4
Submitted crossbow builds: ursacomputing/crossbow @ actions-e01b93275b
@raulcd it seems something is going wrong with the minimal test builds (eg example-python-minimal-build-fedora-conda). The logs indicate "Successfully installed pyarrow-0.1.dev16896+ge3b9892", which then messes up pandas detection of the pyarrow version (for the pyarrow integration in pandas, pandas checks if pyarrow is recent enough and otherwise errors), giving some test failures.
(but also not entirely sure how this PR causes this issue, since I don't see the nightlies fail for the minimal builds at the moment)
(the other failures are the known nightly dlpack failures)
The logs indicate "Successfully installed pyarrow-0.1.dev16896+ge3b9892"
From the git checkout I see is pulling from the remote on Syncing repository: bretttully/arrow
. I recall an issue if dev tags are not present we are unable to detect the correct version. The remote doesn't seem to have other branches and/or tags.
I've opened an issue because we should find a way to not fail if the dev tag is not present:
Thanks for investigating that!
So then to resolve this here, @bretttully should fetch the upstream tags and push that to his fork? Something like
git fetch upstream
git push origin --tags
(assuming upstream is apache/arrow and origin is bretttully/arrow)
I have merged upstream/main
and pushed tags. Let's see if this works...
@github-actions crossbow submit example-python-minimal-build-*
Revision: 685167fb8dc28190f9fd8600bca5df7799663e5a
Submitted crossbow builds: ursacomputing/crossbow @ actions-524e782c26
Task | Status |
---|---|
example-python-minimal-build-fedora-conda | |
example-python-minimal-build-ubuntu-venv |
Rationale for this change
This is a long standing pandas ticket with some fairly horrible workarounds, where complex arrow types do not serialise well to pandas as the pandas metadata string is not parseable. However,
types_mapper
always had highest priority as it overrode what was set before.What changes are included in this PR?
By switching the logical ordering, it means that we don't need to call
_pandas_api.pandas_dtype(dtype)
when using the pyarrow backend, thus resolving the issue of complexdtype
withlist
orstruct
. It will likely still fail if the numpy backend is used, but at least this gives a working solution rather than an inability to load files at all.Are these changes tested?
Existing tests should stay unchanged and a new test for the complex type has been added
Are there any user-facing changes?
This PR contains a "Critical Fix". This makes
pd.read_parquet(..., dtype_backend="pyarrow")
work with complex data types where the metadata added by pyarrow duringpd.to_parquet
is not serialisable and currently throwing an exception. This issue currently prevents the use of pyarrow as the default backend for pandas.