deduplicate_neighbors crashing on lakes

initze commented 10 months ago

Hi Juliet and Robyn, I started testing deduplication on my lakes, which runs into an error

Setup gdf

# Add properties (if they don't already exist)
gdf1['source_file'] = lakes05.name
gdf2['source_file'] = lakes06.name

gdf1['Date'] = '2020-01-01'
gdf2['Date'] = '2020-01-01'

# Combine the two files
gdf = pd.concat([gdf1.iloc[:10000], gdf2.iloc[:10000]])

gdf['Area'] = gdf.area
gdf['Centroid_x'] = gdf1.centroid.x
gdf['Centroid_y'] = gdf1.centroid.y

Run deduplication

# Deduplicate
output = deduplicate_neighbors(
    gdf,
    split_by='source_file',
    prop_area='Area_start_ha',
    prop_centroid_x='Centroid_x',
    prop_centroid_y='Centroid_y',
    keep_rules=['Area_start_ha', 'larger'], # 'Area_start_ha' is my area field
    overlap_tolerance=0.1,
    overlap_both=True,
    centroid_tolerance=5,
    distance_crs='EPSG:3413',
    return_intersections=True
)

Error Message

KeyError                                  Traceback (most recent call last)
Cell In [22], line 2
      1 # Deduplicate
----> 2 output = deduplicate_neighbors(
      3     gdf,
      4     split_by='source_file',
      5     prop_area='Area_start_ha',
      6     prop_centroid_x='Centroid_x',
      7     prop_centroid_y='Centroid_y',
      8     keep_rules=['Area_start_ha', 'larger'],
      9     overlap_tolerance=0.1,
     10     overlap_both=True,
     11     centroid_tolerance=5,
     12     distance_crs='EPSG:3413',
     13     return_intersections=True
     14 )

File Deduplicator.py:381, in deduplicate_neighbors(gdf, split_by, prop_area, prop_centroid_x, prop_centroid_y, keep_rules, overlap_tolerance, overlap_both, centroid_tolerance, distance_crs, return_intersections, prop_duplicated)
    379 if return_intersections:
    380     to_return['intersections'] = duplicates.copy()
--> 381     to_return['intersections'].drop(
    382         [prop_int_id, prop_id + '_1', prop_id + '_2'],
    383         axis=1, inplace=True
    384     )

File pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File pandas/core/frame.py:4957, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4809 @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"])
   4810 def drop(
   4811     self,
   (...)
   4818     errors: str = "raise",
   4819 ):
   4820     """
   4821     Drop specified labels from rows or columns.
   (...)
   4955             weight  1.0     0.8
   4956     """
-> 4957     return super().drop(
   4958         labels=labels,
   4959         axis=axis,
   4960         index=index,
   4961         columns=columns,
   4962         level=level,
   4963         inplace=inplace,
   4964         errors=errors,
   4965     )

File pandas/core/generic.py:4267, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4265 for axis, labels in axes.items():
   4266     if labels is not None:
-> 4267         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4269 if inplace:
   4270     self._update_inplace(obj)

File pandas/core/generic.py:4311, in NDFrame._drop_axis(self, labels, axis, level, errors, consolidate, only_slice)
   4309         new_axis = axis.drop(labels, level=level, errors=errors)
   4310     else:
-> 4311         new_axis = axis.drop(labels, errors=errors)
   4312     indexer = axis.get_indexer(new_axis)
   4314 # Case for non-unique axis
   4315 else:

File pandas/core/indexes/base.py:6661, in Index.drop(self, labels, errors)
   6659 if mask.any():
   6660     if errors != "ignore":
-> 6661         raise KeyError(f"{list(labels[mask])} not found in axis")
   6662     indexer = indexer[~mask]
   6663 return self.delete(indexer)

KeyError: "['intersect_id_e11f9db9897c43769e2a1b4405d83461', 'temp_id_e11f9db9897c43769e2a1b4405d83461_1', 'temp_id_e11f9db9897c43769e2a1b4405d83461_2'] not found in axis"

julietcohen commented 10 months ago

Hi Ingmar, thank you for providing code and the error message! In our releases of the viz-staging package, returning intersections is not currently functional. That's why I have the default set to False and documented it here, but clearly I should have taken the extra step to introduce a warning for the user if they set the value of that argument to True. I apologize for not making it more clear that this functionality was removed.

The option to return intersections was originally built in when Robyn developed this deduplication method. Since then, I changed the way we deduplicate a little. One of the edits I made was changing the deduplication output to be a labeled geodataframe, rather than a dictionary. I'll open a new issue for re-integrating this functionality, since clearly it is indeed useful.

Are you able to check if this deduplication fits the needs of your dataset without the intersections geodataframe?

julietcohen commented 9 months ago

Ingmar passed on the data files for the 2 UTM zones he was using for testing the deduplication method, in parquet format as well as geopackage format. These have been uploaded to: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/

I will test the deduplication with these files.

julietcohen commented 5 months ago

This script may work. I used some of Ingmar's suggested parameters, and some defaults from the deduplication method itself.

test_dedup.py

``` # Check Neighbor dedup for Ingamr # Issue: https://github.com/PermafrostDiscoveryGateway/viz-staging/issues/36 # access his files stored at `/var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/` from pathlib import Path import os # visual checks & vector data wrangling import geopandas as gpd import pandas as pd # staging import pdgstaging from pdgstaging.Deduplicator import deduplicate_neighbors from pdgstaging import ConfigManager, TilePathManager, TMSGrid # rasterization & web-tiling import pdgraster from pdgraster import RasterTiler # logging from datetime import datetime import logging import logging.handlers from pdgstaging import logging_config # import 2 adjacent files base_dir = '/var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/' dir = Path(base_dir) ext = '*.gpkg' # To define each .shp file within each subdir as a string representation with forward slashes, use as_posix() # The ** represents that any subdir string can be present between the base_dir and the filename input = [p.as_posix() for p in dir.glob('**/' + ext)] print("Reading in files...") gdf_1 = gpd.read_file(input[0]) # 32618 gdf_2 = gpd.read_file(input[1]) # 32617 print("Cleaning files for NA values.") # remove rows with NA values from the attribute of interest gdf_1.dropna(subset = ['ChangeRateNet_myr-1'], inplace = True) gdf_2.dropna(subset = ['ChangeRateNet_myr-1'], inplace = True) # add source_file attribute gdf_1["source_file"] = '32618_river' gdf_2["source_file"] = '32617_river' files = [gdf_1, gdf_2] print("Concatenating data.") # concatenate files into one GDF gdf_combined = gpd.GeoDataFrame(pd.concat(files, ignore_index = True)) (print("Deduplicating.")) gdf_dups_flagged = deduplicate_neighbors( gdf = gdf_combined, split_by = 'source_file', prop_area = 'Area_start_ha', keep_rules=['Area_start_ha', 'larger'], overlap_tolerance=0.1, overlap_both=True, centroid_tolerance=5, return_intersections=False ) print("Saving deduplicated data to a single file.") gdf_dups_flagged.to_file("/home/jcohen/check_dedup_forIN/clean/32617-32618_dups_flagged.gpkg", drive = "GPKG") print("Script complete.") ```

julietcohen commented 5 months ago

@initze @kaylahardie @tcnichol I am happy to report that I wrote a script that flags deduplicates in 2 adjacent geospatial files outside of the visualization workflow. This script’s output file is a geopackage that has the same geometries as the input files (concatenated), but the output file also has a boolean attribute staging_duplicated where True represents the row is a duplicate lake, and False represents the lake we should retain in the data.

Note that the script reads in two adjacent UTM zones (parquet files) that Ingmar passed on last week for testing the deduplication method. One important pre-processing step included here is adding a new attribute called source_file before executing the deduplication. This is relevant because our deduplication labeling only executes if the input data contains polygons from 2 different source files.

You may test this approach yourself with different parameters for deduplicate_neighbors, but remember the following:

return_intersections must be False for the current version of the viz-staging package
include double brackets around the value of keep_rules
The value of prop_area needs to be None because the units of area in your data attributes may not be in the same units as the CRS
The units of the CRS of the input data are important because the proportion of the overlap of two geometries is used to determine if they are duplicates, so make sure to double check that the CRS is what you expect before you execute this deduplication on your files

test_dedup.py

``` # Check Neighbor deduplication for Ingmar # Issue: https://github.com/PermafrostDiscoveryGateway/viz-staging/issues/36 # visual checks & vector data wrangling import geopandas as gpd import pandas as pd # staging import pdgstaging from pdgstaging.Deduplicator import deduplicate_neighbors from pdgstaging import ConfigManager, TilePathManager, TMSGrid # logging from datetime import datetime import logging import logging.handlers from pdgstaging import logging_config gdf_1 = gpd.read_parquet("/var/data/submission/pdg/nitze_lake_change/identify_dups_sample_20240605/32617_river.parquet") gdf_2 = gpd.read_parquet("/var/data/submission/pdg/nitze_lake_change/identify_dups_sample_20240605/32618_river.parquet") print("Cleaning files for NA values.") # remove rows with NA values from the attribute of interest gdf_1.dropna(subset = ['ChangeRateNet_myr-1'], inplace = True) gdf_2.dropna(subset = ['ChangeRateNet_myr-1'], inplace = True) # add source_file attribute gdf_1["source_file"] = '32617_river' gdf_2["source_file"] = '32618_river' files = [gdf_1, gdf_2] print("Concatenating data.") gdf_combined = gpd.GeoDataFrame(pd.concat(files, ignore_index = True)) print(f"CRS of input data: {gdf_combined.crs}\n Starting duplicate flagging.") gdf_dups_flagged = deduplicate_neighbors( gdf = gdf_combined, split_by = 'source_file', prop_area = None, prop_centroid_x = None, prop_centroid_y = None, keep_rules = [["Perimeter_meter", "larger"]], overlap_tolerance = 0.1, overlap_both = False, centroid_tolerance = None, distance_crs = 'EPSG:3857', return_intersections = False, prop_duplicated = 'staging_duplicated' ) sum_dups = gdf_dups_flagged['staging_duplicated'].sum() print(f"Number of duplicate lakes: {sum_dups}") print("Saving deduplicated data to a single file.") gdf_dups_flagged.to_file("/home/jcohen/check_dedup_forIN/clean/32617-32618_dups_flagged.gpkg", drive = "GPKG") print("Script complete.") ```

This issue can be re-opened if this approach does not work for Ingmar or Todd. With the output from this approach, you can take the ID values for the duplicate lakes and deduplicate the lake change data before they pass on the data to me to use as input into the visualization workflow. Please let me know here or on Slack if you have questions.

PermafrostDiscoveryGateway / viz-staging