Rasterization fails for relatively few staged files inconsistently

julietcohen commented 1 year ago

Rasterization step logs errors for relatively few staged files compared to the total amount of staged files. Occurs for both the maximum z-level as well as parent z-levels according to log.log. For the lake size change data sample, the GeoPackage files that errored during a certain parsl run are not obviously corrupt, and when rasterization is applied again (outside of the workflow but still with parallelization) the files are successfully rasterized. See this documented here. This was also seen in the ray workflow with IWP dataset, for which Robyn noted:

IWP run update: Of 10,805,019 staged tiles, I managed to rasterize and transfer (to scratch) 10,741,400, which means 63,619 tiles (~0.5%) got lost along the way. This could be that some files didn’t transfer to scratch before the 24 hour job limit ran out, or it could be some other problem. I did see some warnings that implied that some geopackage files were corrupt. Since it’s overall a small percent that are missing, I am going to continue with the next steps so that we can visualize what we already have. We can always go back and compare the list of geotiff tiles to staged tiles to see which ones we are missing and try to rasterize just those ones.

It would be helpful to do more runs with various datasets to determine if the failing files are ever consistent, and if the errors are random (and can therefore be solved by just trying to rasterize these few staged files again with the same approach), or if the files are actually corrupt, or if it has to do with the files not transferring to scratch dir like Robyn suggested.

julietcohen commented 1 year ago

Made branch feature-raster-retry (from develop branch) to test modification to RasterTiler.py to integrate 3 tries total for staged files that fail to rasterize. The retries simply apply the same rasterization approach, and includes log statements when retries are applied. The syntax & execution for this is show in the following example:

# define function to exemplify retry approach
def test_func(x, y):
    for attempt in range(2):
        try:
            z = x / y
            print(f'Success: {x} / {y} = {z}')
            return z
        except Exception as e:
            print(f'Error dividing {x} by {y} so trying again.')
        else:
            break
    else:
         print(f'Error dividing {x} by {y}, ran out of retries.')
         return None

Apply function with expected input types that won't error:

test_func(x = 9, y = 3)

output:

Success: 9 / 3 = 3.0
3.0

Apply function with unexpected input types that will error:

test_func(x = 'elephant', y = 'cat')

Output:

Error dividing elephant by cat so trying again.
Error dividing elephant by cat so trying again.
Error dividing elephant by cat, ran out of retries.

This retry feature is applied to the rasterize_vector() function.

rasterize_vector() with retry

```python def rasterize_vector(self, path, overwrite=True): """ Given a path to an output file from the viz-staging step, create a GeoTIFF and save it to the configured dir_geotiff directory. By default, if the output geotiff already exists, it will be overwritten. To change this behaviour, set overwrite to False. During this process, the min and max values (and other summary stats) of the data arrays that comprise the GeoTIFFs for each band will be tracked. Parameters ---------- path : str Path to the staged vector file to rasterize. overwrite : bool Optional, defaults to True. If set to False, then if there is an existing GeoTiff tile at the output path created, rasterization will be skipped. Returns ------- morecantile.Tile or None The tile that was rasterized or None if there was an error. """ for attempt in range(2): try: # Get information about the tile from the path tile = self.tiles.tile_from_path(path) out_path = self.tiles.path_from_tile(tile, 'geotiff') if os.path.isfile(out_path) and not overwrite: logger.info(f'Skip rasterizing {path} for tile {tile}.' ' Tile already exists.') return None bounds = self.tiles.get_bounding_box(tile) # Track and log the event id = self.__start_tracking('geotiffs_from_vectors') logger.info(f'Rasterizing {path} for tile {tile} to {out_path}.') gdf = gpd.read_file(path) # Check if deduplication should be performed first dedup_here = self.config.deduplicate_at('raster') dedup_method = self.config.get_deduplication_method() if dedup_here and dedup_method is not None: prop_duplicated = self.config.polygon_prop('duplicated') if prop_duplicated in gdf.columns: gdf = gdf[~gdf[prop_duplicated]] # Get properties to pass to the rasterizer raster_opts = self.config.get_raster_config() # Rasterize raster = Raster.from_vector( vector=gdf, bounds=bounds, **raster_opts) raster.write(out_path) # Track and log the end of the event message = f'Rasterization for tile {tile} complete.' self.__end_tracking(id, raster=raster, tile=tile, message=message) logger.info( f'Complete rasterization of tile {tile} to {out_path}.') return tile except Exception as e: logger.info(f'Error rasterizing {path} for tile {tile} so trying again.') else: break else: message = f'Error rasterizing {path} for tile {tile}, ran out of retries.' self.__end_tracking(id, tile=tile, message=message) # note that the error = e argument is removed from __end_tracking(), # because e is no longer locally defined due to the break just before # find way to maintain e in the error message for a better workflow return None ```

We apply this with the virtual env rasterRetry with local develop branch of viz-staging, local feature-raster-retry branch of viz-workflow, sqlite3, and parsl.

see installed packages

```python Package Version Editable project location ------------------ ---------- ------------------------- affine 2.3.1 asttokens 2.0.5 attrs 22.2.0 backcall 0.2.0 bcrypt 4.0.1 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 2.1.1 click 8.1.3 click-plugins 1.1.1 cligj 0.7.2 coloraide 0.18.1 colormaps 0.3 contourpy 1.0.6 cryptography 38.0.4 cycler 0.11.0 debugpy 1.5.1 decorator 5.1.1 dill 0.3.6 entrypoints 0.4 executing 0.8.3 filelock 3.8.2 Fiona 1.8.22 fonttools 4.38.0 geopandas 0.12.2 globus-sdk 3.15.1 idna 3.4 ipykernel 6.15.2 ipython 8.7.0 jedi 0.18.1 jupyter_client 7.4.7 jupyter_core 4.11.2 kiwisolver 1.4.4 matplotlib 3.6.2 matplotlib-inline 0.1.6 morecantile 3.2.5 munch 2.5.0 nest-asyncio 1.5.5 numpy 1.24.0 packaging 22.0 pandas 1.5.2 paramiko 2.12.0 parsl 2022.12.19 parso 0.8.3 pdgraster 0.1.0 /home/jcohen/viz-raster pdgstaging 0.1.0 /home/jcohen/viz-staging pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.3.0 pip 22.3.1 prompt-toolkit 3.0.20 psutil 5.9.4 ptyprocess 0.7.0 pure-eval 0.2.2 pycparser 2.21 pydantic 1.10.2 Pygments 2.11.2 PyJWT 2.6.0 PyNaCl 1.5.0 pyparsing 3.0.9 pyproj 3.4.1 python-dateutil 2.8.2 pytz 2022.7 pyzmq 24.0.1 rasterio 1.3.4 requests 2.28.1 Rtree 0.9.7 setproctitle 1.3.2 setuptools 65.5.0 shapely 2.0.0 six 1.16.0 snuggs 1.4.7 stack-data 0.2.0 tblib 1.7.0 tornado 6.2 traitlets 5.7.1 typeguard 2.13.3 typing_extensions 4.4.0 urllib3 1.26.13 wcwidth 0.2.5 wheel 0.37.1 ```

see config

```python { "dir_input": "/home/jcohen/viz-workflow/raster-retry/data_subsample", "dir_geotiff": "raster-retry/OUTPUT_GEOTIFFS", "dir_web_tiles": "raster-retry/OUTPUT_WEBTILE", "dir_staged": "raster-retry/OUTPUT_STAGING_TILES", "filename_staging_summary": "raster-retry/staging_summary.csv", "filename_rasterization_events": "raster-retry/raster_events.csv", "filename_rasters_summary": "raster-retry/raster_summary.csv", "version": "DATE", "ext_input": ".gpkg", "simplify_tolerance": 0.0001, "tms_id": "WGS1984Quad", "z_range": [0, 11], "statistics": [ { "name": "polygon_count", "weight_by": "count", "property": "centroids_per_pixel", "aggregation_method": "sum", "resampling_method": "sum", "val_range": [0, null], "nodata_val": 0, "nodata_color": "#ffffff00", "palette": ["#d9c43f", "#d93fce"] }, { "name": "coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [0, 1], "nodata_val": 0, "nodata_color": "#ffffff00", "palette": ["#d9c43f", "#d93fce"] } ], "deduplicate_at": ["staging"], "deduplicate_method": "neighbor", "deduplicate_keep_rules": [["staging_filename", "larger"]], "deduplicate_overlap_tolerance": 0.1, "deduplicate_overlap_both": false, "deduplicate_centroid_tolerance": null } ```

Data sample is 3000 randomly sampled polygons from lake size change dataset (1000 from each UTM zone). In order to ensure the retry code is run, we purposefully corrupt one of the GeoPackage files by using sqlite3 to delete one of the mandatory tables after the file is staged. See this in the script here.

julietcohen commented 1 year ago

Executed a run with the same environment and config as above, but inserted variable e into log statement that is printed for the initial tries. e cannot be printed in the final attempt's error message printed by __end_tracking() because of the break statement just before.

rasterize_vector() with retry

```python def rasterize_vector(self, path, overwrite=True): """ Given a path to an output file from the viz-staging step, create a GeoTIFF and save it to the configured dir_geotiff directory. By default, if the output geotiff already exists, it will be overwritten. To change this behaviour, set overwrite to False. During this process, the min and max values (and other summary stats) of the data arrays that comprise the GeoTIFFs for each band will be tracked. Parameters ---------- path : str Path to the staged vector file to rasterize. overwrite : bool Optional, defaults to True. If set to False, then if there is an existing GeoTiff tile at the output path created, rasterization will be skipped. Returns ------- morecantile.Tile or None The tile that was rasterized or None if there was an error. """ for attempt in range(2): try: # Get information about the tile from the path tile = self.tiles.tile_from_path(path) out_path = self.tiles.path_from_tile(tile, 'geotiff') if os.path.isfile(out_path) and not overwrite: logger.info(f'Skip rasterizing {path} for tile {tile}.' ' Tile already exists.') return None bounds = self.tiles.get_bounding_box(tile) # Track and log the event id = self.__start_tracking('geotiffs_from_vectors') logger.info(f'Rasterizing {path} for tile {tile} to {out_path}.') gdf = gpd.read_file(path) # Check if deduplication should be performed first dedup_here = self.config.deduplicate_at('raster') dedup_method = self.config.get_deduplication_method() if dedup_here and dedup_method is not None: prop_duplicated = self.config.polygon_prop('duplicated') if prop_duplicated in gdf.columns: gdf = gdf[~gdf[prop_duplicated]] # Get properties to pass to the rasterizer raster_opts = self.config.get_raster_config() # Rasterize raster = Raster.from_vector( vector=gdf, bounds=bounds, **raster_opts) raster.write(out_path) # Track and log the end of the event message = f'Rasterization for tile {tile} complete.' self.__end_tracking(id, raster=raster, tile=tile, message=message) logger.info( f'Complete rasterization of tile {tile} to {out_path}.') return tile except Exception as e: logger.info(f'Error rasterizing {path} for tile {tile} due to error {e} so trying again.') else: break else: message = f'Error rasterizing {path} for tile {tile}, ran out of retries.' self.__end_tracking(id, tile=tile, message=message) return None ```

I'm curious if there is a benefit to printing the explicit e in the __end_tracking() rather than a general log statement. If there is a benefit to that, this change to rasterize_vector() should be adjusted further to enable printing e after the final try. If printing e in the initial tries' log statements is equally as helpful to printing it in the final __end_tracking() message, then I believe this code is good to go!

For this run, which staged 3000 polygons, then we corrupted 1 before rasterization started: files in staged dir	files in geotiff dir	files in web tiles dir
2964	6496	12992

No errors were reported in log.log besides the two initial tries for the purposefully corrupted file, and the error for the final try:

For the run with original `develop` branch RasterTiler.py script without retries, with the same starting 3000 polygons, and none are corrupted: files in staged dir	files in geotiff dir	files in web tiles dir
2964	6498	12996

No errors reported in that log.log at all. This means that corrupting only that one staged file resulted in 2 fewer rasters and 4 fewer web tiles.

julietcohen commented 1 year ago

Just tested this with the same everything except reduced the amount of retries to 1. I figure this is better because with large datasets such as IWP, even with only ~0.5% of the files initially failing, reducing the amount of retries from 2 to 1 will save time and Delta credits. We don't want to waste resources retrying the same file an unnecessary amount of times.

I would ideally like to test this workflow on a larger amount of data now that the syntax and number of tries has shown to be successful. I could use all the lake change files Ingmar uploaded to the PDG google drive. I am not sure if Robyn already processed them, but I believe she has not. It would be interesting to run all those files on the current develop branch (without the retry) and see how many files fail to rasterize, then process the same files with this branch and see how many more were rasterized. That would allow for a deeper check for this update without using any credits on Delta.

PermafrostDiscoveryGateway / viz-raster

Rasterization fails for relatively few staged files inconsistently #7