Closed initze closed 5 months ago
Hi Ingmar, thank you for providing code and the error message! In our releases of the viz-staging
package, returning intersections is not currently functional. That's why I have the default set to False
and documented it here, but clearly I should have taken the extra step to introduce a warning for the user if they set the value of that argument to True
. I apologize for not making it more clear that this functionality was removed.
The option to return intersections was originally built in when Robyn developed this deduplication method. Since then, I changed the way we deduplicate a little. One of the edits I made was changing the deduplication output to be a labeled geodataframe, rather than a dictionary. I'll open a new issue for re-integrating this functionality, since clearly it is indeed useful.
Are you able to check if this deduplication fits the needs of your dataset without the intersections geodataframe?
Ingmar passed on the data files for the 2 UTM zones he was using for testing the deduplication method, in parquet format as well as geopackage format. These have been uploaded to: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/
I will test the deduplication with these files.
This script may work. I used some of Ingmar's suggested parameters, and some defaults from the deduplication method itself.
@initze @kaylahardie @tcnichol I am happy to report that I wrote a script that flags deduplicates in 2 adjacent geospatial files outside of the visualization workflow. This script’s output file is a geopackage that has the same geometries as the input files (concatenated), but the output file also has a boolean attribute staging_duplicated
where True
represents the row is a duplicate lake, and False
represents the lake we should retain in the data.
Note that the script reads in two adjacent UTM zones (parquet files) that Ingmar passed on last week for testing the deduplication method. One important pre-processing step included here is adding a new attribute called source_file
before executing the deduplication. This is relevant because our deduplication labeling only executes if the input data contains polygons from 2 different source files.
You may test this approach yourself with different parameters for deduplicate_neighbors
, but remember the following:
return_intersections
must be False
for the current version of the viz-staging
packagekeep_rules
prop_area
needs to be None because the units of area in your data attributes may not be in the same units as the CRSThis issue can be re-opened if this approach does not work for Ingmar or Todd. With the output from this approach, you can take the ID values for the duplicate lakes and deduplicate the lake change data before they pass on the data to me to use as input into the visualization workflow. Please let me know here or on Slack if you have questions.
Hi Juliet and Robyn, I started testing deduplication on my lakes, which runs into an error
Setup gdf
Run deduplication
Error Message