PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline
Apache License 2.0
2 stars 4 forks source link

Clipping to footprint in config still results in overlap in staged tiles #17

Closed julietcohen closed 1 year ago

julietcohen commented 1 year ago

As outlined in this comment in issue#19, it seems that setting "deduplicate_clip_to_footprint": True in the config did not result in the desired visualization in the web tiles. The resulting web tiles still showed overlap at the edges. The config used for this IWP test run is here.

julietcohen commented 1 year ago

Overview of Clipping to Footprint

julietcohen commented 1 year ago

Check if any staged tiles' property staged_deduplicated is set to True

As noted above, the function add_properties() adds the property 'duplicated' to each gdf (which is converted to 'staging_duplicated'), and assigns every row to False. It is intended that later on when duplicates are identified, the values in those rows are converted to True.

Plan of attack: stage one Russia shp file (from the same subset of data used in the last test run on Delta) and take a look at this property 'staging_duplicated' in all the rows of all the resulting staged files. We can safely assume that if the deduplication is working properly, there will be some True values. This is especially safe to assume because when we plot this shapefile over its footprint, we see that some polygons "hang" over the edge! It is likely but not certain that there will be some True values for duplicates. While it is obvious in the plot below that polygons "hang" over the edge, these would be removed by clipping not by being labeled as duplicates.

image
Config with clip to footprint True ``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/output_russia_subset/", # base dir of all output, needs to change every run with definition of output_subdir "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", # base dir of all .shp files to be staged "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", # the footprints start on /scratch before we transfer them to /tmp "dir_staged": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staged/", "dir_geotiff": "/home/jcohen/iwp_workflow_testing/output_russia_subset/geotiff/", "dir_web_tiles": "/home/jcohen/iwp_workflow_testing/output_russia_subset/web_tiles/", # we do not use /tmp for webtile step, it writes directly to /scratch "filename_staging_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staging_summary.csv", "filename_rasterization_events": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_events.csv", "filename_rasters_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_summary.csv", "filename_config": "/home/jcohen/iwp_workflow_testing/output_russia_subset/config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```
julietcohen commented 1 year ago

Execute check on one shp file's staged files

The any() function checks each value in a list and returns True if any value is True, and False is all values are False.

Staging the first shp file in the Russia shp file subset resulted in 715 staged gpkg files. Collect all these filepaths into a list, open them as a gdf in geopandas, and check if the column of interest (staging_duplicated) contains any True values.

check_if_any_true_dups.py ``` # Check if any staged files have True for the staging_duplicate property # when config is set to clip to footprint # general imports for viz workflow import pdgstaging from pathlib import Path import pdgraster import geopandas as gpd import config # for checking if there are any true values: import glob input_shp = "/home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" footprint_filepath = "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" config = config.IWP_CONFIG stager = pdgstaging.TileStager(config = config, check_footprints = False) # generate the staged files stager.stage(input_shp) # check if the staged files have any True values in their 'staging_duplicated' column input_dir = Path('/home/jcohen/iwp_workflow_testing/output_russia_subset/staged') gpkg_files_posix = sorted(input_dir.rglob('*.gpkg')) # remove prefix "PosixPath()" part from each full path gpkg_files = [] for file in gpkg_files_posix: # extract just str filepath file_str = str(file) gpkg_files.append(file_str) print(f"There are {len(gpkg_files)} staged files resulting from the one input shp file.") true_props = [] false_props = [] # sum of len of true_props and len of false_props should be 715 # because that is the total number of staged files # and the any() function returns 1 output per file for file in gpkg_files: gdf = gpd.read_file(file) #values.append(str(gdf['staging_duplicated'])) if any(gdf['staging_duplicated']): # if a True value is present in any row of this df's 'staging_duplicated' column # then add an observation to the True list true_props.append("True value detected") else: # if False is present every row of this df's 'staging_duplicated' column # then add an observation to the False list false_props.append("all False values detected") print(f"Is the sum of the two lists is 715? {(len(true_props) + len(false_props)) == 715}") print(f"The number of files with at least 1 True value detected is {len(true_props)}.\n 0 files means that 0/715 polygons were identified as duplicates.") ```

Output:

There are 715 staged files resulting from the one input shp file.
Is the sum of the two lists is 715? True
The number of files with at least 1 True value detected is 0.
 0 files means that 0/715 polygons were identified as duplicates.

The plot thickens!

julietcohen commented 1 year ago

Execute check on same shp file's staged files when config is set to deduplicate at staging instead of raster

Change the config to: "deduplicate_at": [ "staging" ]

instead of "raster" to determine if this results in duplicates identified in any rows of the staged files.

Config with clip to footprint true, and dedup at staging ``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/output_russia_subset/", # base dir of all output, needs to change every run with definition of output_subdir "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", # base dir of all .shp files to be staged "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", # the footprints start on /scratch before we transfer them to /tmp "dir_staged": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staged/", "dir_geotiff": "/home/jcohen/iwp_workflow_testing/output_russia_subset/geotiff/", "dir_web_tiles": "/home/jcohen/iwp_workflow_testing/output_russia_subset/web_tiles/", # we do not use /tmp for webtile step, it writes directly to /scratch "filename_staging_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/staging_summary.csv", "filename_rasterization_events": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_events.csv", "filename_rasters_summary": "/home/jcohen/iwp_workflow_testing/output_russia_subset/raster_summary.csv", "filename_config": "/home/jcohen/iwp_workflow_testing/output_russia_subset/config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "staging" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```
check_of_any_true_dups.py ``` # Check if any staged files have True for the staging_duplicate property # when config is set to clip to footprint # general imports for viz workflow import pdgstaging from pathlib import Path import pdgraster import geopandas as gpd import config_dedup_staging # for checking if there are any true values: import glob input_shp = "/home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" footprint_filepath = "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp" config = config_dedup_staging.IWP_CONFIG stager = pdgstaging.TileStager(config = config, check_footprints = False) # generate the staged files stager.stage(input_shp) # check if the staged files have any True values in their 'staging_duplicated' column input_dir = Path('/home/jcohen/iwp_workflow_testing/output_russia_subset/staged') gpkg_files_posix = sorted(input_dir.rglob('*.gpkg')) # remove prefix "PosixPath()" part from each full path gpkg_files = [] for file in gpkg_files_posix: # extract just str filepath file_str = str(file) gpkg_files.append(file_str) print(f"There are {len(gpkg_files)} staged files resulting from the one input shp file.") true_props = [] false_props = [] # sum of len of true_props and len of false_props should be 715 # because that is the total number of staged files # and the any() function returns 1 output per file for file in gpkg_files: gdf = gpd.read_file(file) #values.append(str(gdf['staging_duplicated'])) if any(gdf['staging_duplicated']): # if a True value is present in any row of this df's 'staging_duplicated' column # then add an observation to the True list true_props.append("True value detected") else: # if False is present every row of this df's 'staging_duplicated' column # then add an observation to the False list false_props.append("all False values detected") print(f"Is the sum of the two lists is 715? {(len(true_props) + len(false_props)) == 715}") print(f"The number of files with at least 1 True value detected is {len(true_props)}.\n 0 files means that 0/715 polygons were identified as duplicates.") ```

Output:

There are 715 staged files resulting from the one input shp file.
Is the sum of the two lists is 715? True
The number of files with at least 1 True value detected is 0.
 0 files means that 0/715 polygons were identified as duplicates.

Same as last time.

julietcohen commented 1 year ago

Code review for labeling duplicates as True

The part of combine_and_deduplicate() that removes the rows that are labeled as True works as expected.

Looking at the code that labels the rows as True before we execute combine_and_deduplicate():

julietcohen commented 1 year ago

Comparing logging output when config set to dedup at staging versus raster

Input file has 50955 polygons. This was derived by reading in the one input shp file as a gdf and just checking length. This is confirmed by the logging statement that checks length after it is filtered for polygon geoms only.

Run 1: dedup at raster

config.py ``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/", "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```

All output staged files' rows summed up is 49335 (number of polygons).

log.log ``` 2023-03-10 14:16:47,386 [INFO] root: Checking for footprint files... 2023-03-10 14:16:47,387 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:16:52,806 [INFO] root: Checking for footprint files... 2023-03-10 14:16:52,807 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:17:09,612 [INFO] root: Staging file /home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp 2023-03-10 14:17:09,656 [INFO] root: Length of gdf (input file with geometry type filtered for polygons) is 50955 ```

Run 2: dedup at staging

config.py ``` IWP_CONFIG = { "deduplicate_clip_to_footprint": True, "dir_output": "/home/jcohen/iwp_workflow_testing/", "dir_input": "/home/jcohen/iwp_workflow_testing/input/russia_subset/", "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_workflow_testing/footprints/russia_subset/", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "staging" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" } ```

All output staged files' rows summed up is 49335 again.

log.log ``` 2023-03-10 14:23:57,305 [INFO] root: Checking for footprint files... 2023-03-10 14:23:57,306 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:24:26,624 [INFO] root: Checking for footprint files... 2023-03-10 14:24:26,624 [INFO] root: Found 1 matching footprints. 0 missing. 2023-03-10 14:24:42,806 [INFO] root: Staging file /home/jcohen/iwp_workflow_testing/input/russia_subset/WV02_20100720235408_1030010006ABA200_10JUL20235408-M1BS-500152191020_01_P002_u16rf3413_pansh.shp 2023-03-10 14:24:42,853 [INFO] root: Length of gdf (input file with geometry type filtered for polygons) is 50955 ```

Run 3: dedup at staging and change default clip_to_footprint to true in deduplicate_by_footprint()

Same outputs. I am confused why the logging statements that I inserted into Deduplicator.py are not generated in log.log. Yet there are still fewer polygons after staging than in the initial input shp file.

julietcohen commented 1 year ago

Bug Identified

The relationship between label_duplicates(), the number of input shp files that are within one gdf, and the function clip_gdf() is the source of the lack of clipping to footprint/deduplication. Notes on how these interact is in this thread above here.

Essentially, the gdf is not being clipped to footprint if the gdf contains polygons from only 1 input shp file ("group"). This is the case every time we save a file that doesn't already exist.

Robyn suggested that the same problem is also present here; we only run the combine_and_deduplicate method if we are adding polygons into an existing tile.

She suggested that we improve this approach by clipping the entire file by the footprint within the TileStager before we save the tiles here.