Clipping to footprint in config still results in overlap in staged tiles

PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline

Apache License 2.0

2 stars 4 forks source link

Clipping to footprint in config still results in overlap in staged tiles #17

Closed julietcohen closed 1 year ago

julietcohen commented 1 year ago

As outlined in this comment in issue#19, it seems that setting "deduplicate_clip_to_footprint": True in the config did not result in the desired visualization in the web tiles. The resulting web tiles still showed overlap at the edges. The config used for this IWP test run is here.

julietcohen commented 1 year ago

Overview of Clipping to Footprint

the function clip_gdf() returns a dict of 2 values: keep and remove
clip_gdf() is called within deduplicate_by_footprint() which is defined within viz-staging's Deduplicator.py script
deduplicate_by_footprint() (defined here) is the step in which polygons are flagged as duplicates (True or False) with the function label_duplicates() (defined here)
those rows are actually dropped when we call combine_and_deduplicate() if the config is set to dedup at that step
we call deduplicate_by_footprint() when we call get_deduplication_method()
get_deduplication_method() is defined here and called 3 times in TileStager.py: when we call add_properties(), when we save_tiles(), and when we combine_and_deduplicate(), but we do not deduplicate IWP during staging (we do during raster step).
- calling get_deduplication_method() within add_properties() is certainly not an issue; this call is just present to ensure that the value of the deduplication method in the config is not none. If it is not none then the property 'duplicated' is added to each staged file and every row is set to false because we have not yet checked if each row (polygon) is a duplicate. This is one of Robyn's additions to viz-staging 10/28.
- similarly, calling get_deduplication_method() within save_tiles() just checks that the dedupplication method is not none, and then moves forward with executing combine_and_deduplicate() if it is not none. During staging for IWP, deduplication does not occur
- similarly again, get_deduplication_method() is called in combine_and_deduplicate() and if it is none then we just return the same gdf, but there is no else if it is not set to none

julietcohen commented 1 year ago

Check if any staged tiles' property `staged_deduplicated` is set to `True`

As noted above, the function add_properties() adds the property 'duplicated' to each gdf (which is converted to 'staging_duplicated'), and assigns every row to False. It is intended that later on when duplicates are identified, the values in those rows are converted to True.

Plan of attack: stage one Russia shp file (from the same subset of data used in the last test run on Delta) and take a look at this property 'staging_duplicated' in all the rows of all the resulting staged files. We can safely assume that if the deduplication is working properly, there will be some True values. This is especially safe to assume because when we plot this shapefile over its footprint, we see that some polygons "hang" over the edge! It is likely but not certain that there will be some True values for duplicates. While it is obvious in the plot below that polygons "hang" over the edge, these would be removed by clipping not by being labeled as duplicates.

Config with clip to footprint True

``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/output_russia_subset/", # base dir of all output, needs to change every run with definition of output_subdir "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", # base dir of all .shp files to be staged "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", # the footprints start on /scratch before we transfer them to /tmp "dir_staged": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staged/", "dir_geotiff": "/home/jcohen/iwp_workflow_testing/output_russia_subset/geotiff/", "dir_web_tiles": "/home/jcohen/iwp_workflow_testing/output_russia_subset/web_tiles/", # we do not use /tmp for webtile step, it writes directly to /scratch "filename_staging_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staging_summary.csv", "filename_rasterization_events": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_events.csv", "filename_rasters_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_summary.csv", "filename_config": "/home/jcohen/iwp_workflow_testing/output_russia_subset/config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```

julietcohen commented 1 year ago

Execute check on one shp file's staged files

The any() function checks each value in a list and returns True if any value is True, and False is all values are False.

Staging the first shp file in the Russia shp file subset resulted in 715 staged gpkg files. Collect all these filepaths into a list, open them as a gdf in geopandas, and check if the column of interest (staging_duplicated) contains any True values.

check_if_any_true_dups.py

``` # Check if any staged files have True for the staging_duplicate property # when config is set to clip to footprint # general imports for viz workflow import pdgstaging from pathlib import Path import pdgraster import geopandas as gpd import config # for checking if there are any true values: import glob input_shp = "/home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" footprint_filepath = "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" config = config.IWP_CONFIG stager = pdgstaging.TileStager(config = config, check_footprints = False) # generate the staged files stager.stage(input_shp) # check if the staged files have any True values in their 'staging_duplicated' column input_dir = Path('/home/jcohen/iwp_workflow_testing/output_russia_subset/staged') gpkg_files_posix = sorted(input_dir.rglob('*.gpkg')) # remove prefix "PosixPath()" part from each full path gpkg_files = [] for file in gpkg_files_posix: # extract just str filepath file_str = str(file) gpkg_files.append(file_str) print(f"There are {len(gpkg_files)} staged files resulting from the one input shp file.") true_props = [] false_props = [] # sum of len of true_props and len of false_props should be 715 # because that is the total number of staged files # and the any() function returns 1 output per file for file in gpkg_files: gdf = gpd.read_file(file) #values.append(str(gdf['staging_duplicated'])) if any(gdf['staging_duplicated']): # if a True value is present in any row of this df's 'staging_duplicated' column # then add an observation to the True list true_props.append("True value detected") else: # if False is present every row of this df's 'staging_duplicated' column # then add an observation to the False list false_props.append("all False values detected") print(f"Is the sum of the two lists is 715? {(len(true_props) + len(false_props)) == 715}") print(f"The number of files with at least 1 True value detected is {len(true_props)}.\n 0 files means that 0/715 polygons were identified as duplicates.") ```

Output:

There are 715 staged files resulting from the one input shp file.
Is the sum of the two lists is 715? True
The number of files with at least 1 True value detected is 0.
 0 files means that 0/715 polygons were identified as duplicates.

The plot thickens!

julietcohen commented 1 year ago

Execute check on same shp file's staged files when config is set to deduplicate at staging instead of raster

Change the config to: "deduplicate_at": [ "staging" ]

instead of "raster" to determine if this results in duplicates identified in any rows of the staged files.

Config with clip to footprint true, and dedup at staging

``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/output_russia_subset/", # base dir of all output, needs to change every run with definition of output_subdir "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", # base dir of all .shp files to be staged "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", # the footprints start on /scratch before we transfer them to /tmp "dir_staged": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staged/", "dir_geotiff": "/home/jcohen/iwp_workflow_testing/output_russia_subset/geotiff/", "dir_web_tiles": "/home/jcohen/iwp_workflow_testing/output_russia_subset/web_tiles/", # we do not use /tmp for webtile step, it writes directly to /scratch "filename_staging_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staging_summary.csv", "filename_rasterization_events": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_events.csv", "filename_rasters_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_summary.csv", "filename_config": "/home/jcohen/iwp_workflow_testing/output_russia_subset/config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "staging" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```

check_of_any_true_dups.py

``` # Check if any staged files have True for the staging_duplicate property # when config is set to clip to footprint # general imports for viz workflow import pdgstaging from pathlib import Path import pdgraster import geopandas as gpd import config_dedup_staging # for checking if there are any true values: import glob input_shp = "/home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" footprint_filepath = "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" config = config_dedup_staging.IWP_CONFIG stager = pdgstaging.TileStager(config = config, check_footprints = False) # generate the staged files stager.stage(input_shp) # check if the staged files have any True values in their 'staging_duplicated' column input_dir = Path('/home/jcohen/iwp_workflow_testing/output_russia_subset/staged') gpkg_files_posix = sorted(input_dir.rglob('*.gpkg')) # remove prefix "PosixPath()" part from each full path gpkg_files = [] for file in gpkg_files_posix: # extract just str filepath file_str = str(file) gpkg_files.append(file_str) print(f"There are {len(gpkg_files)} staged files resulting from the one input shp file.") true_props = [] false_props = [] # sum of len of true_props and len of false_props should be 715 # because that is the total number of staged files # and the any() function returns 1 output per file for file in gpkg_files: gdf = gpd.read_file(file) #values.append(str(gdf['staging_duplicated'])) if any(gdf['staging_duplicated']): # if a True value is present in any row of this df's 'staging_duplicated' column # then add an observation to the True list true_props.append("True value detected") else: # if False is present every row of this df's 'staging_duplicated' column # then add an observation to the False list false_props.append("all False values detected") print(f"Is the sum of the two lists is 715? {(len(true_props) + len(false_props)) == 715}") print(f"The number of files with at least 1 True value detected is {len(true_props)}.\n 0 files means that 0/715 polygons were identified as duplicates.") ```

Output:

There are 715 staged files resulting from the one input shp file.
Is the sum of the two lists is 715? True
The number of files with at least 1 True value detected is 0.
 0 files means that 0/715 polygons were identified as duplicates.

Same as last time.

julietcohen commented 1 year ago

Code review for labeling duplicates as True

The part of combine_and_deduplicate() that removes the rows that are labeled as True works as expected.

Looking at the code that labels the rows as True before we execute combine_and_deduplicate():

The function label_duplicates() includes an input parameter deduplicate_output, which in our case is the output of the function deduplicate_by_footprint() because we are using the footprint approach rather than the neighbor approach.
deduplicate_output should be a dictionary, since the first step this function executes is separating the contents of deduplicate_output into non_duplicates and duplicates like so:
```
not_duplicates = deduplicate_output['keep']
duplicates = deduplicate_output['removed']
```
deduplicate_by_footprint() has a parameter label that's set to True as the default. This parameter is documented as:

label : bool, optional Set to True (default) to return the input GDF with the polygons identified as duplicates labeled as such. The column name that will be used to flag duplicates is set with the prop_duplicated option.
deduplicate_by_footprint() executes the function label_duplicates() if label is True
As mentioned, an input parameter of label_duplicates() is to_return which is an output of deduplicate_by_footprint() defined here, just before it executes label_duplicates()
to_return is indeed a dictionary that contains 'keep', 'removed', and 'intersections'. The output of label_duplicates() overrides the input to_return with a new to_return
label_duplicates() calls the 'keep' values non_duplicates and the removed values duplicates
clip_gdf() is executed before labeling within deduplicate_by_footprint() if there is more than one group in the gdf, meaning more than one input shp filename (see dedup config here for the meaning of split_by), but clip_gdf() is executed after labeling if there was only 1 group identified. Since for this debugging I am using just 1 input shp file, the labeling is occurring here, because "if there is only 1 file, there is nothing to deduplicate" ...and the rest of the deduplicate_by_footprint() is not executed? (need to double check that!)
The parameter clip_to_footprint is set to False by default in this function and I wonder if setting the config to "deduplicate_clip_to_footprint": True overrides this default in the function.

julietcohen commented 1 year ago

Comparing logging output when config set to dedup at staging versus raster

Inserted logging statements:
- start of Deduplicator.py
- start of the function clip_gdf() to indicate when clipping occurs
- after shp file is read in as gdf and filtered for polygon geoms only to check length (number of polygons)
Comparing initial number of polygons to the total number of summed polygons in all the resulting staged files. Polygons that were present initially would not be present in output staged files because of on of the following: 1) deduplication 2) clipping to footprint 3) error during staging.

Input file has 50955 polygons. This was derived by reading in the one input shp file as a gdf and just checking length. This is confirmed by the logging statement that checks length after it is filtered for polygon geoms only.

Run 1: dedup at raster

config.py

``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/", "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```

All output staged files' rows summed up is 49335 (number of polygons).

log.log

``` 2023-03-10 14:16:47,386 [INFO] root: Checking for footprint files... 2023-03-10 14:16:47,387 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:16:52,806 [INFO] root: Checking for footprint files... 2023-03-10 14:16:52,807 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:17:09,612 [INFO] root: Staging file /home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp 2023-03-10 14:17:09,656 [INFO] root: Length of gdf (input file with geometry type filtered for polygons) is 50955 ```

Run 2: dedup at staging

config.py

``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/", "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "staging" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```

All output staged files' rows summed up is 49335 again.

log.log

``` 2023-03-10 14:23:57,305 [INFO] root: Checking for footprint files... 2023-03-10 14:23:57,306 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:24:26,624 [INFO] root: Checking for footprint files... 2023-03-10 14:24:26,624 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:24:42,806 [INFO] root: Staging file /home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp 2023-03-10 14:24:42,853 [INFO] root: Length of gdf (input file with geometry type filtered for polygons) is 50955 ```

Run 3: dedup at staging and change default clip_to_footprint to true in `deduplicate_by_footprint()`

Same outputs. I am confused why the logging statements that I inserted into Deduplicator.py are not generated in log.log. Yet there are still fewer polygons after staging than in the initial input shp file.

julietcohen commented 1 year ago

Bug Identified

The relationship between label_duplicates(), the number of input shp files that are within one gdf, and the function clip_gdf() is the source of the lack of clipping to footprint/deduplication. Notes on how these interact is in this thread above here.

Essentially, the gdf is not being clipped to footprint if the gdf contains polygons from only 1 input shp file ("group"). This is the case every time we save a file that doesn't already exist.

Robyn suggested that the same problem is also present here; we only run the combine_and_deduplicate method if we are adding polygons into an existing tile.

She suggested that we improve this approach by clipping the entire file by the footprint within the TileStager before we save the tiles here.

PermafrostDiscoveryGateway / viz-staging

Clipping to footprint in config still results in overlap in staged tiles #17

Overview of Clipping to Footprint

Check if any staged tiles' property staged_deduplicated is set to True

Execute check on one shp file's staged files

Execute check on same shp file's staged files when config is set to deduplicate at staging instead of raster

Code review for labeling duplicates as True

Comparing logging output when config set to dedup at staging versus raster

Run 1: dedup at raster

Run 2: dedup at staging

Run 3: dedup at staging and change default clip_to_footprint to true in deduplicate_by_footprint()

Bug Identified

Check if any staged tiles' property `staged_deduplicated` is set to `True`

Run 3: dedup at staging and change default clip_to_footprint to true in `deduplicate_by_footprint()`