PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline
Apache License 2.0
2 stars 1 forks source link

deduplicate_neighbors crashing on lakes #36

Open initze opened 3 months ago

initze commented 3 months ago

Hi Juliet and Robyn, I started testing deduplication on my lakes, which runs into an error

Setup gdf

# Add properties (if they don't already exist)
gdf1['source_file'] = lakes05.name
gdf2['source_file'] = lakes06.name

gdf1['Date'] = '2020-01-01'
gdf2['Date'] = '2020-01-01'

# Combine the two files
gdf = pd.concat([gdf1.iloc[:10000], gdf2.iloc[:10000]])

gdf['Area'] = gdf.area
gdf['Centroid_x'] = gdf1.centroid.x
gdf['Centroid_y'] = gdf1.centroid.y

Run deduplication

# Deduplicate
output = deduplicate_neighbors(
    gdf,
    split_by='source_file',
    prop_area='Area_start_ha',
    prop_centroid_x='Centroid_x',
    prop_centroid_y='Centroid_y',
    keep_rules=['Area_start_ha', 'larger'], # 'Area_start_ha' is my area field
    overlap_tolerance=0.1,
    overlap_both=True,
    centroid_tolerance=5,
    distance_crs='EPSG:3413',
    return_intersections=True
)

Error Message

KeyError                                  Traceback (most recent call last)
Cell In [22], line 2
      1 # Deduplicate
----> 2 output = deduplicate_neighbors(
      3     gdf,
      4     split_by='source_file',
      5     prop_area='Area_start_ha',
      6     prop_centroid_x='Centroid_x',
      7     prop_centroid_y='Centroid_y',
      8     keep_rules=['Area_start_ha', 'larger'],
      9     overlap_tolerance=0.1,
     10     overlap_both=True,
     11     centroid_tolerance=5,
     12     distance_crs='EPSG:3413',
     13     return_intersections=True
     14 )

File Deduplicator.py:381, in deduplicate_neighbors(gdf, split_by, prop_area, prop_centroid_x, prop_centroid_y, keep_rules, overlap_tolerance, overlap_both, centroid_tolerance, distance_crs, return_intersections, prop_duplicated)
    379 if return_intersections:
    380     to_return['intersections'] = duplicates.copy()
--> 381     to_return['intersections'].drop(
    382         [prop_int_id, prop_id + '_1', prop_id + '_2'],
    383         axis=1, inplace=True
    384     )

File pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File pandas/core/frame.py:4957, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4809 @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"])
   4810 def drop(
   4811     self,
   (...)
   4818     errors: str = "raise",
   4819 ):
   4820     """
   4821     Drop specified labels from rows or columns.
   (...)
   4955             weight  1.0     0.8
   4956     """
-> 4957     return super().drop(
   4958         labels=labels,
   4959         axis=axis,
   4960         index=index,
   4961         columns=columns,
   4962         level=level,
   4963         inplace=inplace,
   4964         errors=errors,
   4965     )

File pandas/core/generic.py:4267, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4265 for axis, labels in axes.items():
   4266     if labels is not None:
-> 4267         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4269 if inplace:
   4270     self._update_inplace(obj)

File pandas/core/generic.py:4311, in NDFrame._drop_axis(self, labels, axis, level, errors, consolidate, only_slice)
   4309         new_axis = axis.drop(labels, level=level, errors=errors)
   4310     else:
-> 4311         new_axis = axis.drop(labels, errors=errors)
   4312     indexer = axis.get_indexer(new_axis)
   4314 # Case for non-unique axis
   4315 else:

File pandas/core/indexes/base.py:6661, in Index.drop(self, labels, errors)
   6659 if mask.any():
   6660     if errors != "ignore":
-> 6661         raise KeyError(f"{list(labels[mask])} not found in axis")
   6662     indexer = indexer[~mask]
   6663 return self.delete(indexer)

KeyError: "['intersect_id_e11f9db9897c43769e2a1b4405d83461', 'temp_id_e11f9db9897c43769e2a1b4405d83461_1', 'temp_id_e11f9db9897c43769e2a1b4405d83461_2'] not found in axis"
julietcohen commented 3 months ago

Hi Ingmar, thank you for providing code and the error message! In our releases of the viz-staging package, returning intersections is not currently functional. That's why I have the default set to False and documented it here, but clearly I should have taken the extra step to introduce a warning for the user if they set the value of that argument to True. I apologize for not making it more clear that this functionality was removed.

The option to return intersections was originally built in when Robyn developed this deduplication method. Since then, I changed the way we deduplicate a little. One of the edits I made was changing the deduplication output to be a labeled geodataframe, rather than a dictionary. I'll open a new issue for re-integrating this functionality, since clearly it is indeed useful.

Are you able to check if this deduplication fits the needs of your dataset without the intersections geodataframe?

julietcohen commented 2 months ago

Ingmar passed on the data files for the 2 UTM zones he was using for testing the deduplication method, in parquet format as well as geopackage format. These have been uploaded to: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/

I will test the deduplication with these files.