geopandas / geopandas

Python tools for geographic data
http://geopandas.org/
BSD 3-Clause "New" or "Revised" License
4.52k stars 935 forks source link

BUG: Inconsistent suffix bevahiour between merge and sjoin #2936

Open ennoToUpper opened 1 year ago

ennoToUpper commented 1 year ago

Code Sample, a copy-pastable example

import geopandas

df_1 = geopandas.GeoDataFrame(columns=["A", "B", "geom_1"], geometry="geom_1")
df_2 = geopandas.GeoDataFrame(columns=["B", "C", "geom_2"], geometry="geom_2")

joined_df = df_1.sjoin_nearest(df_2, lsuffix="left", rsuffix="right")
merged_df = df_1.merge(df_2, left_index=True, right_index=True, suffixes=["left", "right"])

print(f"{joined_df.columns[1]} vs. {merged_df.columns[1]}")
# Result: B_left vs. Bleft

Problem description

Currently sjoinnearest adds a '_' between the column name and the suffix while merge does not. The behaviour should be consistent between the functions that have a somewhat similiar behaviour otherwise. The '' gets added at line 263 in sjoin.py.

The way the parameters are passed is also not consistent. sjoin_nearest uses separate named arguments while merge does not. Both sjoin_nearest and merge declare in their description that they both add 2 gdfs toghether.

Finally I would like to add that sjoin -like merge- is not user friendly as it takes the arguments in the form of *args and **kwargs making it hard to know what are valid arguments. This combined with the inconsistent suffix appending creates issues that are unpredictable.

Expected Output

Either both ways add the "_" or both don't.

.sjoin_nearest(gdf, suffixes=[...]

Output of geopandas.show_versions()

SYSTEM INFO ----------- python : 3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] executable : C:\Users\max\AppData\Local\Programs\Python\Python311\python.exe machine : Windows-10-10.0.19045-SP0 GEOS, GDAL, PROJ INFO --------------------- GEOS : 3.11.1 GEOS lib : None GDAL : 3.5.2 GDAL data dir: C:\Users\max\AppData\Local\Programs\Python\Python311\Lib\site-packages\fiona\gdal_data PROJ : 9.2.0 PROJ data dir: C:\Users\max\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyproj\proj_dir\share\proj PYTHON DEPENDENCIES ------------------- geopandas : 0.13.2 numpy : 1.24.2 pandas : 2.0.0 pyproj : 3.5.0 shapely : 2.0.1 fiona : 1.9.3 geoalchemy2: 0.13.2 geopy : None matplotlib : None mapclassify: None pygeos : None pyogrio : None psycopg2 : 2.9.6 (dt dec pq3 ext lo64) pyarrow : None rtree : None
Pavanmahaveer7 commented 1 year ago

As we are having B columns in both dfs and if we wanna do join, here we could explicitly specify joined_df = gdf1.sjoin_nearest(gdf2, lsuffix=suffix_left, rsuffix=suffix_right, how="inner") it good to lsuffix=suffix_left variable suffix rather than lsuffix=left which is fixed