Closed aaraney closed 1 year ago
I think it's pandas, but I believe there's a default seed in the hash function for the column. If that was changed for 2.0 for whatever reason, it'd change the results of the hash function on the column.
Yeah, the more I look into this, I am also convinced that it is pandas
too. Just so we are all on the same page, the test that is failing is comparing hashes derived from a geopackage version of the hydrofabric. Here is the code:
@property
def uid(self) -> str:
# removed docstring for readability
layer_hashes = [np.apply_along_axis(hash_array, 0, self._dataframes[l].values).sum() for l in self._layer_names]
return hashlib.sha1(','.join([str(h) for h in layer_hashes]).encode('UTF-8')).hexdigest()
self._dataframes
is a dictionary of geopackage layer name to geopandas Dataframe of that layer.
I wrote up a script to do basically the same thing to more easily compare pandas
versions. The script is in the twirl down if you are interested.
Looking as the combined output below, it looks like the discrepancies are in the numeric datatypes. This leads me to think there might be discrepancies in how na
/ None
values are either represented and / or hashed between the two versions. Looking into that now.
layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
1.5.3:[9910909016688245206, 4257016642943720818, 523519807225067410, 4254618872625276163, 9604451431115515325]
2.0.0 [9910909016688245206, 7714536644407060282, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
1.5.3: [11397087368252117007, 10440532369965811158, 5750755180915541183, 6610491045185399198, 12480819840933176463, 16223896574922372839, 9910909016688245206, 4254618872625276163, 1289937201443695659]
2.0.0: [11397087368252117007, 295265702994286425, 3079000369136598424, 6610491045185399198, 16619260224802041063, 10866567253940249541, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
1.5.3: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
2.0.0: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
1.5.3: [11397087368252117007, 4254618872625276163, 18446744073709551609]
2.0.0: [11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [11397087368252117007, 18446744073709551609, 18446744073709551609, 18446744073709551612, 6945878055642010754, 10639750839072135192, 14059973711066289446, 6292855941134473512, 15378706233695216860, 10232710820753472246, 18446744073709551612, 18446744073709551612, 3296008595016872355, 16812141029101979377, 7177634671963409277, 5902442301448765624, 18446744073709551609]
2.0.0: [11397087368252117007, 18446744073709551609, 18446744073709551609, 3179149979871189512, 11986439733596641007, 11099151438926169628, 17812221497236259243, 5178549615237787454, 93929622743921236, 11651573430419844666, 3179149979871189512, 3179149979871189512, 6398729633788531660, 1245452818585839232, 12465211508769344715, 6362659139139935070, 18446744073709551609]
layer: crosswalk
columns : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
1.5.3: [3752687620607738028, 10679333900438930345, 11644050028096106736, 2569989159426054126, 11240671135618503159, 13074351299704411022, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
2.0.0: [3752687620607738028, 10679333900438930345, 8742293209049677422, 8334557188444525933, 15888965815924940002, 101168048888588003, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 5978206864406560404, 7481721767258314475, 4938289557806202459, 4938289557806202459, 4938289557806202459, 4938289557806202459, 10879245026471355412, 10879245026471355412, 10879245026471355412, 10879245026471355412, 3720317251396855529, 3720317251396855529, 3720317251396855529, 3720317251396855529, 10415052716999926891, 7882324864216808000, 6833873961703491437, 12397854334200377101, 11714806517525372952, 14084054139117991909, 14084054139117991909, 14084054139117991909, 14084054139117991909, 13183223717059209830, 13183223717059209830, 13183223717059209830, 13183223717059209830, 17622765900528814828, 18446744073709551609]
2.0.0: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 17218099967287971373, 13656126546251369994, 8348731800770055667, 8348731800770055667, 8348731800770055667, 8348731800770055667, 5343560834163534075, 5343560834163534075, 5343560834163534075, 5343560834163534075, 3269881205696930493, 3269881205696930493, 3269881205696930493, 3269881205696930493, 13084577964608946138, 17632152680294710829, 6280827595974488025, 11366273065597691016, 12540298840484485334, 6453370661115098702, 6453370661115098702, 6453370661115098702, 6453370661115098702, 11911029417517962984, 11911029417517962984, 11911029417517962984, 11911029417517962984, 15250992649697139781, 18446744073709551609]
So, ive started to isolate the problem, however I still dont understand why this is happening. Something seems different about pd.DataFrame.values
between 1.5.3
and 2.0.0
:
1.5.3
apply to each row
[14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024]
apply_along_axis to each row
[14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024]
apply to each column
[17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488]
2.0.0
apply to each row
[14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024,
14167925659631966024]
apply_along_axis to each row
[17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488]
apply to each column
[17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488,
17625445661095305488]
So, I figured it out. Here is the simplest example that illustrates and reproduces the problem:
import numpy as np
from pandas.util import hash_array
a = np.array([1.0], dtype="object")
print(hash_array(a))
# 1.5.3
# [3035652100526550566]
# 2.0.0
# [7736021350537868001]
Having looked through the pandas source, this regression was introduced in https://github.com/pandas-dev/pandas/pull/50001, specifically here (diff below).
diff --git a/pandas/core/util/hashing.py b/pandas/core/util/hashing.py
index 5a5e46e0227aa..e0b18047aa0ec 100644
--- a/pandas/core/util/hashing.py
+++ b/pandas/core/util/hashing.py
@@ -344,9 +344,7 @@ def _hash_ndarray(
)
codes, categories = factorize(vals, sort=False)
- cat = Categorical(
- codes, Index._with_infer(categories), ordered=False, fastpath=True
- )
+ cat = Categorical(codes, Index(categories), ordered=False, fastpath=True)
return _hash_categorical(cat, encoding, hash_key)
try:
In short, the array is categorized and in 1.5.3
the type is inferred using the values in the, now category instead of using the dtype
as specified on the np.ndarray
object. In 2.0.0
it now seems that this has been fixed. So hashed np.ndarray
's now respect their dtype
rather. Tying this back to pd.DataFrame.values
, .values
must set its returned np.ndarray
's dtype
to a type that types in the collection can be cast to (e.g. float64
, int32
, object
). So in our case, since we have a dataframe of strings, float, and ints, .dtype
has to be set to object
. This consequently is the inherited type of any inner dimension in an ndarray
view. My guess is that .values
actually returns a copy on write (CoW) view of the dataframe's inner ndarray
's and that view has to "show" all inner array dimension types as the outer most dtype
.
More wierdness
import numpy as np
import hashlib
from pandas.util import hash_array, hash_pandas_object
import geopandas as gpd
import fiona
p = "<path-to-repo>/dmod/refactor-data-service/data/example_hydrofabric_2/hydrofabric.gpkg"
layers = fiona.listlayers(p)
dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}
layer_hashes = [np.apply_along_axis(hash_array, 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [10103771696888273306, 4572071176093428412, 15272590391029730009, 15651706240877393163, 15901469198598983537, 17501800407106816969, 756873605291097582]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 15651706240877393163, 12994735377762201353, 12039723046569473286, 18438881045715204344]
layer_hashes = [np.apply_along_axis(lambda h: hash_array(h, categorize=False), 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]
layer_hashes = [np.sum(hash_pandas_object(dataframes[layer]).values) for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
layer_hashes = [hash_pandas_object(dataframes[layer]).sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]
# 2.0.0
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]
layer_hashes = [hash_pandas_object(dataframes[layer]).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
layer_hashes = [dataframes[layer].apply(hash_pandas_object).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]
# 2.0.0
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]
layer_hashes = [dataframes[layer].apply(lambda a: hash_array(a.values), axis=0).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]
# 2.0.0
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]
Given that hash_pandas_object
produces the same result for both versions (if the sum is computed using numpy
), I think our best bet is to switch our implementation to use hash_pandas_object
. Having talked with @robertbartel about this, the reason hash_array
is likely used now is because of concerns with geopandas
and specifically geometry
columns in a geopandas
dataframe. In brief, geopandas
uses shapey
objects to represent geometries
and at one point (shapely<=2.0.0
) shapely
geometries were not hashable (see shapely #209 and geopandas #221). However, now we require shapely>=2.0.0
so this should not be an issue.
Reopening this because tests are failing again b.c. of a related failure. This failure started reoccurring 3 weeks ago. https://github.com/NOAA-OWP/DMOD/actions/runs/6510147982/job/17683206387#step:10:319
Traceback (most recent call last): File >"/home/runner/work/DMOD/DMOD/python/lib/modeldata/dmod/test/test_geopackage_hydrofabric.py", >line 309, in test_uid_1_a self.assertEqual(hydrofabric.uid, expected_uid) AssertionError: '10105591058b39504e73842da89e0c3dcac5ba99' != >'b7367023aadad961315dd05e184359dad68613c3' - 10105591058b39504e73842da89e0c3dcac5ba99 + b7367023aadad961315dd05e184359dad68613c3
The same code path is not effected. #468 will track this instead.
The
dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric.test_uid_1_a
test is currently failing in several PRs. Below is a snipped from an action log showing the failure.source
I compared the dependency versions installed when the tests were passing with the failing tests and it seems that
pandas==2.0.0
is the likely culprit. The last known pandas version that works is1.5.3
. I tested this locally withfiona
version1.9.1
and1.9.3
andpandas==1.5.3
and the tests passed. However, there is one outlier action withpandas==2.0.0
andfiona==1.9.2
installed that passed? Im still a little puzzled about that one and ive not been able to reproduce it locally (yet, ill do that in the morning, there isn't afiona
wheel for that version for my machine).Passing with
pandas==1.5.3
Failing withpandas==2.0.0
Weird passing testpandas==2.0.0
I went looking through the fiona's change log and PRs for release
1.9.3
and its doesnt look like anything is related. Ive not gone to look throughgeopandas
change log yet, so I need to check there too.