bcgov / FIT_opendatadownloader

Monitor open data, report when changes are detected
Apache License 2.0
0 stars 0 forks source link

add change detection code #5

Closed smnorris closed 2 months ago

smnorris commented 2 months ago

Add a general tool to generate diffs between two datasets (geodataframes) based on a common key.

For downloaded data, the key is taken from the layer config's primary_key value, but note that many datasets will not have a simple integer/text/uuid/ primary key.

So, in order of priority, run the comparison based on:

  1. provided primary key
  2. when pk is not unique, issue a warning... but continue and add geometry to the pk
  3. when pk is not provided, use geometry as the pk
  4. if pk is still not unique when geometry is included, bail (as complete duplicates will be filtered by #6, duplicates here would indicate weird overlaps in the data that should be investigated)

When using geoms as the pk, try reducing precision slightly via https://geopandas.org/en/latest/docs/reference/api/geopandas.GeoSeries.set_precision.html#geopandas-geoseries-set-precision

The reason for using geom as a pk when comparing datasets is to detect modifications to attributes. Of course, this means that modifications to geoms where no other pk is used will result in additions/deletions (we can't detect modifications to geoms if there is no pk).