Closed julietcohen closed 1 year ago
clip_gdf()
returns a dict of 2 values: keep and removeclip_gdf()
is called within deduplicate_by_footprint()
which is defined within viz-staging
's Deduplicator.py
scriptdeduplicate_by_footprint()
(defined here) is the step in which polygons are flagged as duplicates (True or False) with the function label_duplicates()
(defined here)combine_and_deduplicate()
if the config is set to dedup at that stepdeduplicate_by_footprint()
when we call get_deduplication_method()
get_deduplication_method()
is defined here and called 3 times in TileStager.py
: when we call add_properties()
, when we save_tiles()
, and when we combine_and_deduplicate()
, but we do not deduplicate IWP during staging (we do during raster step).
get_deduplication_method()
within add_properties()
is certainly not an issue; this call is just present to ensure that the value of the deduplication method in the config is not none. If it is not none then the property 'duplicated' is added to each staged file and every row is set to false because we have not yet checked if each row (polygon) is a duplicate. This is one of Robyn's additions to viz-staging
10/28. get_deduplication_method()
within save_tiles()
just checks that the dedupplication method is not none, and then moves forward with executing combine_and_deduplicate()
if it is not none. During staging for IWP, deduplication does not occurget_deduplication_method()
is called in combine_and_deduplicate()
and if it is none then we just return the same gdf, but there is no else
if it is not set to nonestaged_deduplicated
is set to True
As noted above, the function add_properties()
adds the property 'duplicated' to each gdf (which is converted to 'staging_duplicated'), and assigns every row to False. It is intended that later on when duplicates are identified, the values in those rows are converted to True.
Plan of attack: stage one Russia shp file (from the same subset of data used in the last test run on Delta) and take a look at this property 'staging_duplicated' in all the rows of all the resulting staged files. We can safely assume that if the deduplication is working properly, there will be some True values. This is especially safe to assume because when we plot this shapefile over its footprint, we see that some polygons "hang" over the edge! It is likely but not certain that there will be some True values for duplicates. While it is obvious in the plot below that polygons "hang" over the edge, these would be removed by clipping not by being labeled as duplicates.
The any()
function checks each value in a list and returns True if any value is True, and False is all values are False.
Staging the first shp file in the Russia shp file subset resulted in 715 staged gpkg files. Collect all these filepaths into a list, open them as a gdf in geopandas, and check if the column of interest (staging_duplicated
) contains any True values.
Output:
There are 715 staged files resulting from the one input shp file.
Is the sum of the two lists is 715? True
The number of files with at least 1 True value detected is 0.
0 files means that 0/715 polygons were identified as duplicates.
The plot thickens!
Change the config to: "deduplicate_at": [ "staging" ]
instead of "raster"
to determine if this results in duplicates identified in any rows of the staged files.
Output:
There are 715 staged files resulting from the one input shp file.
Is the sum of the two lists is 715? True
The number of files with at least 1 True value detected is 0.
0 files means that 0/715 polygons were identified as duplicates.
Same as last time.
The part of combine_and_deduplicate()
that removes the rows that are labeled as True works as expected.
Looking at the code that labels the rows as True before we execute combine_and_deduplicate()
:
The function label_duplicates()
includes an input parameter deduplicate_output
, which in our case is the output of the function deduplicate_by_footprint()
because we are using the footprint approach rather than the neighbor approach.
deduplicate_output
should be a dictionary, since the first step this function executes is separating the contents of deduplicate_output
into non_duplicates
and duplicates
like so:
not_duplicates = deduplicate_output['keep']
duplicates = deduplicate_output['removed']
deduplicate_by_footprint()
has a parameter label
that's set to True as the default. This parameter is documented as:
label : bool, optional Set to True (default) to return the input GDF with the polygons identified as duplicates labeled as such. The column name that will be used to flag duplicates is set with the prop_duplicated option.
deduplicate_by_footprint()
executes the function label_duplicates()
if label is True
As mentioned, an input parameter of label_duplicates()
is to_return
which is an output of deduplicate_by_footprint()
defined here, just before it executes label_duplicates()
to_return
is indeed a dictionary that contains 'keep', 'removed', and 'intersections'. The output of label_duplicates()
overrides the input to_return
with a new to_return
label_duplicates()
calls the 'keep' values non_duplicates
and the removed
values duplicates
clip_gdf()
is executed before labeling within deduplicate_by_footprint()
if there is more than one group in the gdf, meaning more than one input shp filename (see dedup config here for the meaning of split_by
), but clip_gdf()
is executed after labeling if there was only 1 group identified. Since for this debugging I am using just 1 input shp file, the labeling is occurring here, because "if there is only 1 file, there is nothing to deduplicate" ...and the rest of the deduplicate_by_footprint()
is not executed? (need to double check that!)
The parameter clip_to_footprint
is set to False by default in this function and I wonder if setting the config to "deduplicate_clip_to_footprint": True
overrides this default in the function.
Deduplicator.py
clip_gdf()
to indicate when clipping occursInput file has 50955 polygons. This was derived by reading in the one input shp file as a gdf and just checking length. This is confirmed by the logging statement that checks length after it is filtered for polygon geoms only.
All output staged files' rows summed up is 49335 (number of polygons).
All output staged files' rows summed up is 49335 again.
deduplicate_by_footprint()
Same outputs. I am confused why the logging statements that I inserted into Deduplicator.py
are not generated in log.log
. Yet there are still fewer polygons after staging than in the initial input shp file.
The relationship between label_duplicates()
, the number of input shp files that are within one gdf, and the function clip_gdf()
is the source of the lack of clipping to footprint/deduplication. Notes on how these interact is in this thread above here.
Essentially, the gdf is not being clipped to footprint if the gdf contains polygons from only 1 input shp file ("group"). This is the case every time we save a file that doesn't already exist.
Robyn suggested that the same problem is also present here; we only run the combine_and_deduplicate
method if we are adding polygons into an existing tile.
She suggested that we improve this approach by clipping the entire file by the footprint within the TileStager before we save the tiles here.
As outlined in this comment in issue#19, it seems that setting
"deduplicate_clip_to_footprint": True
in the config did not result in the desired visualization in the web tiles. The resulting web tiles still showed overlap at the edges. The config used for this IWP test run is here.