One test fails with pandas 3.0 dev version

seisman commented 3 months ago

In the GMT Dev tests workflow, one test fails (see https://github.com/GenericMappingTools/pygmt/actions/runs/9540235948/job/26291686710).

__________________________ test_compute_bins_outfile ___________________________
[gw1] linux -- Python 3.12.3 /home/runner/micromamba/envs/pygmt/bin/python

grid = <xarray.DataArray 'z' (lat: 14, lon: 8)> Size: 448B
array([[347.5, 344.5, 386. , 640.5, 617. , 579. , 646.5, 671. ],
 ...3.5 -22.5 -21.5 -20.5 ... -12.5 -11.5 -10.5
Attributes:
    long_name:     elevation (m)
    actual_range:  [190. 981.]
expected_df =    start   stop  bin_id
0  345.5  519.5       0
1  519.5  726.5       1
region = [-52, -48, -22, -18]

    def test_compute_bins_outfile(grid, expected_df, region):
        """
        Test grdhisteq.compute_bins with ``outfile``.
        """
        with GMTTempFile(suffix=".txt") as tmpfile:
            with pytest.warns(RuntimeWarning) as record:
                result = grdhisteq.compute_bins(
                    grid=grid,
                    divisions=2,
                    region=region,
                    outfile=tmpfile.name,
                )
                assert len(record) == 1  # check that only one warning was raised
            assert result is None  # return value is None
            assert Path(tmpfile.name).stat().st_size > 0
            temp_df = pd.read_csv(
                filepath_or_buffer=tmpfile.name,
                sep="\t",
                header=None,
                names=["start", "stop", "bin_id"],
                dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
                index_col="bin_id",
            )
>           pd.testing.assert_frame_equal(
                left=temp_df, right=expected_df.set_index("bin_id")
            )

../pygmt/tests/test_grdhisteq.py:130: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = RangeIndex(start=0, stop=2, step=1, name='bin_id')
right = Index([0, 1], dtype='uint32', name='bin_id'), obj = 'DataFrame.index'

    def _check_types(left, right, obj: str = "Index") -> None:
        if not exact:
            return

        assert_class_equal(left, right, exact=exact, obj=obj)
        assert_attr_equal("inferred_type", left, right, obj=obj)

        # Skip exact dtype checking when `check_categorical` is False
        if isinstance(left.dtype, CategoricalDtype) and isinstance(
            right.dtype, CategoricalDtype
        ):
            if check_categorical:
                assert_attr_equal("dtype", left, right, obj=obj)
                assert_index_equal(left.categories, right.categories, exact=exact)
            return

>       assert_attr_equal("dtype", left, right, obj=obj)
E       AssertionError: DataFrame.index are different
E       
E       Attribute "dtype" are different
E       [left]:  int64
E       [right]: uint32

The fail is likely due to changes/bugs in pandas dev version (3.x). To reproduce the issue:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv("text.dat", sep=r"\s+",
...                 header=None,
...                 names=["start", "stop", "bin_id"],
...                 dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
...                 index_col="bin_id")
>>> df.index.dtype

pandas 2.x returns dtype('uint32') but pandas 3.x returns dtype('int64').

The test data is:

345.5  519.5       0
519.5  726.5       1

Need to read the pandas documentation to understand if it's a desired feature or an upstream bug.

weiji14 commented 3 months ago

Had a look at this locally with pandas=3.0.0.dev0+1125.gc46fb76afa. It seems like pandas is converting the index into a RangeIndex, which has an int64 dtype by default, instead of respecting the uint32 dtype we set. This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at https://github.com/pandas-dev/pandas/issues/9435

For context, we chose to force the uint32 dtype for bin_id at https://github.com/GenericMappingTools/pygmt/pull/1433#discussion_r818920375 (instead of using the int64 default in pandas 1.x/2.x). The reason was because we didn't think anyone would compute more than 2^32 bins with grdhisteq (usually 2^8=256 would be enough), and also there shouldn't be negative numbers in the bin_id column.

So, we could either go with int64 (pandas 3.0 default), or find a way to stick with uint32 (current state). What should we go for?

seisman commented 3 months ago

This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at pandas-dev/pandas#9435

I think it's a pandas bug. For comparison, the following codes return the expected dtype:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.read_csv("text.dat", sep=r"\s+",
   ...:                 header=None,
   ...:                 names=["start", "stop", "bin_id"],
   ...:                 dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
   ...:                 )

In [4]: df2 = df.set_index("bin_id")

In [5]: df2.index.dtype
Out[5]: dtype('uint32')

seisman commented 2 months ago

I've reported the issue to the upstream pandas repository at https://github.com/pandas-dev/pandas/issues/59077. Closing.

GenericMappingTools / pygmt

One test fails with pandas 3.0 dev version #3291