NOAA-OWP / DMOD

Distributed Model on Demand infrastructure for OWP's Model as a Service
Other
7 stars 15 forks source link

Pandas likely 2.0.0 causing modeldata test to fail #324

Closed aaraney closed 1 year ago

aaraney commented 1 year ago

The dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric.test_uid_1_a test is currently failing in several PRs. Below is a snipped from an action log showing the failure.

source

===========================================================================
............................F...............
======================================================================
FAIL: test_uid_1_a (dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric)
Test that the hydrofabric instance for example one has the expected unique id.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/DMOD/DMOD/python/lib/modeldata/dmod/test/test_geopackage_hydrofabric.py", line 309, in test_uid_1_a
    self.assertEqual(hydrofabric.uid, expected_uid)
AssertionError: '7b022f401ea2da1fdce2c1c2e36a8664b2299778' != '8a24b5eeae2596ceaf21058c49a27c8ae6f444ab'
- 7b022f401ea2da1fdce2c1c2e36a8664b2299778
+ 8a24b5eeae2596ceaf21058c49a27c8ae6f444ab

I compared the dependency versions installed when the tests were passing with the failing tests and it seems that pandas==2.0.0 is the likely culprit. The last known pandas version that works is 1.5.3. I tested this locally with fiona version 1.9.1 and 1.9.3 and pandas==1.5.3 and the tests passed. However, there is one outlier action with pandas==2.0.0 and fiona==1.9.2 installed that passed? Im still a little puzzled about that one and ive not been able to reproduce it locally (yet, ill do that in the morning, there isn't a fiona wheel for that version for my machine).

Passing with pandas==1.5.3 Failing with pandas==2.0.0 Weird passing test pandas==2.0.0

I went looking through the fiona's change log and PRs for release 1.9.3 and its doesnt look like anything is related. Ive not gone to look through geopandas change log yet, so I need to check there too.

christophertubbs commented 1 year ago

I think it's pandas, but I believe there's a default seed in the hash function for the column. If that was changed for 2.0 for whatever reason, it'd change the results of the hash function on the column.

aaraney commented 1 year ago

Yeah, the more I look into this, I am also convinced that it is pandas too. Just so we are all on the same page, the test that is failing is comparing hashes derived from a geopackage version of the hydrofabric. Here is the code:

    @property
    def uid(self) -> str:
        # removed docstring for readability
        layer_hashes = [np.apply_along_axis(hash_array, 0, self._dataframes[l].values).sum() for l in self._layer_names]
        return hashlib.sha1(','.join([str(h) for h in layer_hashes]).encode('UTF-8')).hexdigest()

self._dataframes is a dictionary of geopackage layer name to geopandas Dataframe of that layer.

I wrote up a script to do basically the same thing to more easily compare pandas versions. The script is in the twirl down if you are interested.

Hash each column in geopackage script ```python import numpy as np import pandas as pd import geopandas as gpd from pandas.util import hash_array import fiona p = "/data/example_hydrofabric_2/hydrofabric.gpkg" layers = fiona.listlayers(p) dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers} with open(pd.__version__, "w") as f: f.write(f"layers: {str(layers)}\n") for l in layers: f.write(f"layer: {l}\n") f.write(f"columns: {str(dataframes[l].columns.to_list())}\n") f.write(f"column type: {list(map(str, dataframes[l].dtypes))}\n") # computes hash of each element in dataframe (equivalent to pd.Dataframe.applymap) hash_of_each_value = np.apply_along_axis(hash_array, 0, dataframes[l].values) summed_hash_on_each_row = np.apply_along_axis(np.sum, 0, hash_of_each_value) f.write(f"{str(summed_hash_on_each_row.tolist())}\n") print(pd.__version__) ```
Raw output 1.5.3 ``` layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes'] layer: divides columns : ['id', 'areasqkm', 'type', 'toid', 'geometry'] column type: ['object', 'float64', 'object', 'object', 'geometry'] [9910909016688245206, 4257016642943720818, 523519807225067410, 4254618872625276163, 9604451431115515325] layer: flowpaths columns : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry'] column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry'] [11397087368252117007, 10440532369965811158, 5750755180915541183, 6610491045185399198, 12480819840933176463, 16223896574922372839, 9910909016688245206, 4254618872625276163, 1289937201443695659] layer: nexus columns : ['id', 'type', 'toid', 'geometry'] column type: ['object', 'object', 'object', 'geometry'] [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836] layer: flowpath_edge_list columns : ['id', 'toid', 'geometry'] column type: ['object', 'object', 'geometry'] [11397087368252117007, 4254618872625276163, 18446744073709551609] layer: flowpath_attributes columns : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry'] column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry'] [11397087368252117007, 18446744073709551609, 18446744073709551609, 18446744073709551612, 6945878055642010754, 10639750839072135192, 14059973711066289446, 6292855941134473512, 15378706233695216860, 10232710820753472246, 18446744073709551612, 18446744073709551612, 3296008595016872355, 16812141029101979377, 7177634671963409277, 5902442301448765624, 18446744073709551609] layer: crosswalk columns : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry'] column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry'] [3752687620607738028, 10679333900438930345, 11644050028096106736, 2569989159426054126, 11240671135618503159, 13074351299704411022, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603] layer: cfe_noahowp_attributes columns : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry'] column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry'] [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 5978206864406560404, 7481721767258314475, 4938289557806202459, 4938289557806202459, 4938289557806202459, 4938289557806202459, 10879245026471355412, 10879245026471355412, 10879245026471355412, 10879245026471355412, 3720317251396855529, 3720317251396855529, 3720317251396855529, 3720317251396855529, 10415052716999926891, 7882324864216808000, 6833873961703491437, 12397854334200377101, 11714806517525372952, 14084054139117991909, 14084054139117991909, 14084054139117991909, 14084054139117991909, 13183223717059209830, 13183223717059209830, 13183223717059209830, 13183223717059209830, 17622765900528814828, 18446744073709551609] ``` 2.0.0 ``` layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes'] layer: divides columns : ['id', 'areasqkm', 'type', 'toid', 'geometry'] column type: ['object', 'float64', 'object', 'object', 'geometry'] [9910909016688245206, 7714536644407060282, 523519807225067410, 4254618872625276163, 9604451431115515325] layer: flowpaths columns : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry'] column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry'] [11397087368252117007, 295265702994286425, 3079000369136598424, 6610491045185399198, 16619260224802041063, 10866567253940249541, 9910909016688245206, 4254618872625276163, 1289937201443695659] layer: nexus columns : ['id', 'type', 'toid', 'geometry'] column type: ['object', 'object', 'object', 'geometry'] [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836] layer: flowpath_edge_list columns : ['id', 'toid', 'geometry'] column type: ['object', 'object', 'geometry'] [11397087368252117007, 4254618872625276163, 18446744073709551609] layer: flowpath_attributes columns : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry'] column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry'] [11397087368252117007, 18446744073709551609, 18446744073709551609, 3179149979871189512, 11986439733596641007, 11099151438926169628, 17812221497236259243, 5178549615237787454, 93929622743921236, 11651573430419844666, 3179149979871189512, 3179149979871189512, 6398729633788531660, 1245452818585839232, 12465211508769344715, 6362659139139935070, 18446744073709551609] layer: crosswalk columns : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry'] column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry'] [3752687620607738028, 10679333900438930345, 8742293209049677422, 8334557188444525933, 15888965815924940002, 101168048888588003, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603] layer: cfe_noahowp_attributes columns : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry'] column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry'] [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 17218099967287971373, 13656126546251369994, 8348731800770055667, 8348731800770055667, 8348731800770055667, 8348731800770055667, 5343560834163534075, 5343560834163534075, 5343560834163534075, 5343560834163534075, 3269881205696930493, 3269881205696930493, 3269881205696930493, 3269881205696930493, 13084577964608946138, 17632152680294710829, 6280827595974488025, 11366273065597691016, 12540298840484485334, 6453370661115098702, 6453370661115098702, 6453370661115098702, 6453370661115098702, 11911029417517962984, 11911029417517962984, 11911029417517962984, 11911029417517962984, 15250992649697139781, 18446744073709551609] ```

Looking as the combined output below, it looks like the discrepancies are in the numeric datatypes. This leads me to think there might be discrepancies in how na / None values are either represented and / or hashed between the two versions. Looking into that now.

layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns    : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
1.5.3:[9910909016688245206, 4257016642943720818, 523519807225067410, 4254618872625276163, 9604451431115515325]
2.0.0 [9910909016688245206, 7714536644407060282, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns    : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
1.5.3: [11397087368252117007, 10440532369965811158, 5750755180915541183, 6610491045185399198, 12480819840933176463, 16223896574922372839, 9910909016688245206, 4254618872625276163, 1289937201443695659]
2.0.0: [11397087368252117007, 295265702994286425, 3079000369136598424, 6610491045185399198, 16619260224802041063, 10866567253940249541, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns    : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
1.5.3: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
2.0.0: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns    : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
1.5.3: [11397087368252117007, 4254618872625276163, 18446744073709551609]
2.0.0: [11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns    : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [11397087368252117007, 18446744073709551609, 18446744073709551609, 18446744073709551612, 6945878055642010754, 10639750839072135192, 14059973711066289446, 6292855941134473512, 15378706233695216860, 10232710820753472246, 18446744073709551612, 18446744073709551612, 3296008595016872355, 16812141029101979377, 7177634671963409277, 5902442301448765624, 18446744073709551609]
2.0.0: [11397087368252117007, 18446744073709551609, 18446744073709551609, 3179149979871189512, 11986439733596641007, 11099151438926169628, 17812221497236259243, 5178549615237787454, 93929622743921236, 11651573430419844666, 3179149979871189512, 3179149979871189512, 6398729633788531660, 1245452818585839232, 12465211508769344715, 6362659139139935070, 18446744073709551609]
layer: crosswalk
columns    : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
1.5.3: [3752687620607738028, 10679333900438930345, 11644050028096106736, 2569989159426054126, 11240671135618503159, 13074351299704411022, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
2.0.0: [3752687620607738028, 10679333900438930345, 8742293209049677422, 8334557188444525933, 15888965815924940002, 101168048888588003, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns    : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 5978206864406560404, 7481721767258314475, 4938289557806202459, 4938289557806202459, 4938289557806202459, 4938289557806202459, 10879245026471355412, 10879245026471355412, 10879245026471355412, 10879245026471355412, 3720317251396855529, 3720317251396855529, 3720317251396855529, 3720317251396855529, 10415052716999926891, 7882324864216808000, 6833873961703491437, 12397854334200377101, 11714806517525372952, 14084054139117991909, 14084054139117991909, 14084054139117991909, 14084054139117991909, 13183223717059209830, 13183223717059209830, 13183223717059209830, 13183223717059209830, 17622765900528814828, 18446744073709551609]
2.0.0: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 17218099967287971373, 13656126546251369994, 8348731800770055667, 8348731800770055667, 8348731800770055667, 8348731800770055667, 5343560834163534075, 5343560834163534075, 5343560834163534075, 5343560834163534075, 3269881205696930493, 3269881205696930493, 3269881205696930493, 3269881205696930493, 13084577964608946138, 17632152680294710829, 6280827595974488025, 11366273065597691016, 12540298840484485334, 6453370661115098702, 6453370661115098702, 6453370661115098702, 6453370661115098702, 11911029417517962984, 11911029417517962984, 11911029417517962984, 11911029417517962984, 15250992649697139781, 18446744073709551609]
aaraney commented 1 year ago

So, ive started to isolate the problem, however I still dont understand why this is happening. Something seems different about pd.DataFrame.values between 1.5.3 and 2.0.0:

Script ```python import numpy as np import pandas as pd import geopandas as gpd from pandas.util import hash_array from pprint import pprint p = "/data/example_hydrofabric_2/hydrofabric.gpkg" print(pd.__version__) df = gpd.read_file(p, layer="divides") subset = df[["id", "areasqkm"]] subset_loc = df.loc[:, ["id", "areasqkm"]] square = pd.DataFrame({"id": df["id"], "areasqkm": df["areasqkm"]}) loc = pd.DataFrame({"id": df.loc[:, "id"], "areasqkm": df.loc[:, "areasqkm"]}) values = pd.DataFrame({"id": df["id"].values, "areasqkm": df["areasqkm"].values}) tolist = pd.DataFrame({"id": df["id"].values.tolist(), "areasqkm": df["areasqkm"].values.tolist()}) dfs = [subset, subset_loc, square, loc, values, tolist] print("apply to each row") pprint([d.apply(lambda a: hash_array(a.values), axis=0).values.sum() for d in dfs]) print("apply_along_axis to each row") pprint([np.apply_along_axis(hash_array, 0, d.values).sum() for d in dfs]) print("apply to each column") pprint([d.apply(lambda a: hash_array(a.values), axis=1).values.sum().sum() for d in dfs]) print("apply_along_axis to each column") pprint([np.apply_along_axis(hash_array, 1, d.values).sum() for d in dfs]) ```
1.5.3
apply to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply_along_axis to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
2.0.0
apply to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply_along_axis to each row
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
aaraney commented 1 year ago

So, I figured it out. Here is the simplest example that illustrates and reproduces the problem:

import numpy as np
from pandas.util import hash_array

a = np.array([1.0], dtype="object")
print(hash_array(a))

# 1.5.3
# [3035652100526550566]

# 2.0.0
# [7736021350537868001]

Having looked through the pandas source, this regression was introduced in https://github.com/pandas-dev/pandas/pull/50001, specifically here (diff below).

diff --git a/pandas/core/util/hashing.py b/pandas/core/util/hashing.py
index 5a5e46e0227aa..e0b18047aa0ec 100644
--- a/pandas/core/util/hashing.py
+++ b/pandas/core/util/hashing.py
@@ -344,9 +344,7 @@ def _hash_ndarray(
             )

             codes, categories = factorize(vals, sort=False)
-            cat = Categorical(
-                codes, Index._with_infer(categories), ordered=False, fastpath=True
-            )
+            cat = Categorical(codes, Index(categories), ordered=False, fastpath=True)
             return _hash_categorical(cat, encoding, hash_key)

         try:

In short, the array is categorized and in 1.5.3 the type is inferred using the values in the, now category instead of using the dtype as specified on the np.ndarray object. In 2.0.0 it now seems that this has been fixed. So hashed np.ndarray's now respect their dtype rather. Tying this back to pd.DataFrame.values, .values must set its returned np.ndarray's dtype to a type that types in the collection can be cast to (e.g. float64, int32, object). So in our case, since we have a dataframe of strings, float, and ints, .dtype has to be set to object. This consequently is the inherited type of any inner dimension in an ndarray view. My guess is that .values actually returns a copy on write (CoW) view of the dataframe's inner ndarray's and that view has to "show" all inner array dimension types as the outer most dtype.

aaraney commented 1 year ago

More wierdness

import numpy as np
import hashlib
from pandas.util import hash_array, hash_pandas_object
import geopandas as gpd
import fiona

p = "<path-to-repo>/dmod/refactor-data-service/data/example_hydrofabric_2/hydrofabric.gpkg"

layers = fiona.listlayers(p)
dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}

layer_hashes = [np.apply_along_axis(hash_array, 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [10103771696888273306, 4572071176093428412, 15272590391029730009, 15651706240877393163, 15901469198598983537, 17501800407106816969, 756873605291097582]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 15651706240877393163, 12994735377762201353, 12039723046569473286, 18438881045715204344]

layer_hashes = [np.apply_along_axis(lambda h: hash_array(h, categorize=False), 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]

layer_hashes = [np.sum(hash_pandas_object(dataframes[layer]).values) for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]

layer_hashes = [hash_pandas_object(dataframes[layer]).sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]
# 2.0.0
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]

layer_hashes = [hash_pandas_object(dataframes[layer]).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]

layer_hashes = [dataframes[layer].apply(hash_pandas_object).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]
# 2.0.0
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]

layer_hashes = [dataframes[layer].apply(lambda a: hash_array(a.values), axis=0).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]
# 2.0.0
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]
aaraney commented 1 year ago

Given that hash_pandas_object produces the same result for both versions (if the sum is computed using numpy), I think our best bet is to switch our implementation to use hash_pandas_object. Having talked with @robertbartel about this, the reason hash_array is likely used now is because of concerns with geopandas and specifically geometry columns in a geopandas dataframe. In brief, geopandas uses shapey objects to represent geometries and at one point (shapely<=2.0.0) shapely geometries were not hashable (see shapely #209 and geopandas #221). However, now we require shapely>=2.0.0 so this should not be an issue.

aaraney commented 1 year ago

Reopening this because tests are failing again b.c. of a related failure. This failure started reoccurring 3 weeks ago. https://github.com/NOAA-OWP/DMOD/actions/runs/6510147982/job/17683206387#step:10:319

Traceback (most recent call last):
 File >"/home/runner/work/DMOD/DMOD/python/lib/modeldata/dmod/test/test_geopackage_hydrofabric.py", >line 309, in test_uid_1_a
   self.assertEqual(hydrofabric.uid, expected_uid)
AssertionError: '10105591058b39504e73842da89e0c3dcac5ba99' != >'b7367023aadad961315dd05e184359dad68613c3'
- 10105591058b39504e73842da89e0c3dcac5ba99
+ b7367023aadad961315dd05e184359dad68613c3
aaraney commented 1 year ago

The same code path is not effected. #468 will track this instead.