ArtesiaWater / hydropandas

Module for loading observation data into custom DataFrames
https://hydropandas.readthedocs.io
MIT License
52 stars 10 forks source link

Indexing ObsCollection always returns "obs" column #210

Closed martinvonk closed 1 week ago

martinvonk commented 2 months ago

Not sure why this happens but when I use loc on the ObsCollection the column with the "obs" is always returned

oc_sel = oc.loc[:, ["x", "y"]]
oc_sel.columns -> Index(['x', 'y', 'obs'], dtype='object')

Same happens with iloc oc.iloc[:, [0,1]]

martinvonk commented 2 months ago

This is the reason why #209 fails

martinvonk commented 2 months ago

Happened to me after updating pandas to v.2.2.2 Updating pandas (2.2.1 -> 2.2.2)

martinvonk commented 2 months ago

Not sure if we need to do something with this. #209 is already fixed. Maybe if behavior keeps existing after v2.2.2 we need to adress this.

OnnoEbbens commented 2 months ago

It has become an issue in the bro_bronhouder notebook now as well. I think it has something to do with this error: https://github.com/pandas-dev/pandas/issues/57032

OnnoEbbens commented 2 months ago

Geopandas had the same problem: https://github.com/geopandas/geopandas/issues/3060

OnnoEbbens commented 2 months ago

For now I will pin the pandas version to 2.2.1 or lower because this also gives errors in nlmod.

OnnoEbbens commented 1 week ago

It seems like a change in pandas behavior that caused this issue can actually help to improve hydropandas.

Behavior pandas <=2.2.1:

oc.loc[:, ["x", "y"]]

returned an ObsCollection. Just like in geopandas:

gdf.loc[:, ['id']]

would return a GeoDataframe

Recent changes to pandas (=2.2.2) changed this behavior in such a way that:

oc.loc[:, ["x", "y"]]

returns a DataFrame. Due to this behavior the ObsCollection constructor would attach an Obs column to it. I was quite surprised that the loc function calls the constructor when it returns an object but apparently it does.

I added an extra check to the constructor to see if it was called using a DataFrame and an obs_list/ObsClass. Only if one of the obs_list/ObsClass is given the 'obs' column is attached.

We should improve hydropandas further by following the behavior of Geopandas. If you slice an ObsCollection without the 'obs' column a DataFrame should be returned. Now an ObsCollection without the 'obs' column is returned which is an invalid ObsCollection.

We could even think about supporting an ObsSeries. Atm when you obtain a single column from an ObsCollection it will always return a pandas Series. For now I don't really like to do this. I think this adds more headache than hooray.