Closed seisman closed 2 months ago
Had a look at this locally with pandas=3.0.0.dev0+1125.gc46fb76afa
. It seems like pandas is converting the index into a RangeIndex
, which has an int64 dtype by default, instead of respecting the uint32 dtype we set. This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at https://github.com/pandas-dev/pandas/issues/9435
For context, we chose to force the uint32
dtype for bin_id
at https://github.com/GenericMappingTools/pygmt/pull/1433#discussion_r818920375 (instead of using the int64
default in pandas 1.x/2.x). The reason was because we didn't think anyone would compute more than 2^32 bins with grdhisteq
(usually 2^8=256 would be enough), and also there shouldn't be negative numbers in the bin_id column.
So, we could either go with int64
(pandas 3.0 default), or find a way to stick with uint32
(current state). What should we go for?
This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at pandas-dev/pandas#9435
I think it's a pandas bug. For comparison, the following codes return the expected dtype:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.read_csv("text.dat", sep=r"\s+",
...: header=None,
...: names=["start", "stop", "bin_id"],
...: dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
...: )
In [4]: df2 = df.set_index("bin_id")
In [5]: df2.index.dtype
Out[5]: dtype('uint32')
I've reported the issue to the upstream pandas repository at https://github.com/pandas-dev/pandas/issues/59077. Closing.
In the GMT Dev tests workflow, one test fails (see https://github.com/GenericMappingTools/pygmt/actions/runs/9540235948/job/26291686710).
The fail is likely due to changes/bugs in pandas dev version (3.x). To reproduce the issue:
pandas 2.x returns
dtype('uint32')
but pandas 3.x returnsdtype('int64')
.The test data is:
Need to read the pandas documentation to understand if it's a desired feature or an upstream bug.