PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline
Apache License 2.0
2 stars 4 forks source link

Automate adding an evenly-spaced numerical code for categorical datasets #48

Open julietcohen opened 6 months ago

julietcohen commented 6 months ago

Some datasets submitted to the PDG are categorical data rather than numeric continuous data. These categorical codes may come in the form of strings, such as in the permafrost and ground ice dataset that has 4 categories of permafrost coverage, each described with a letter, and they are ordered (least to most permafrost coverage). Other categorical codes come as numbers, but the number does not represent magnitude or order. Instead, each number represents a string, such as in the SACHI_v2 infrastructure dataset that has 7 different infrastructure types.

For ordered categorical datasets like the permafrost and ground ice coverage dataset, we want to use a palette that has 4 distinct shades of 1 or 2 colors, like light blue to dark blue, to show the rank of each category. But unlike a continuous dataset, there should only be 4 shades, nothing in between.

For unordered categorical datasets like the infrastructure dataset, we want to use a palette with a distinct color for each possible value, such as red, blue, green, orange, gray, etc. rather than a scaled of shades of one color or a diverging palette of a couple colors, because we want the categories to appear unrelated to each other.

In order to web tile and assign a palette in the first place, the values in the raster cells must be numbers, and cannot be strings. So for a dataset like permafrost and ground ice, we can assign the number 1 to the category with the least coverage, and 2 to the category with more coverage, 3 to the category with second to most coverage, and 4 to the category with the most coverage.

For the infrastructure dataset, the categorical values are already numbers, but they are not 1, 2, 3, 4, 5, 6, 7 (which are evenly spaced), but they are instead 11, 12, 13, 20, 30, 40, 50 (which are unevenly spaced). As a result, even if we assign a palette to this stat with 7 distinct colors, the web tiling step will fail to assign one color to one category. In order to successfully assign one color to one category, it seems that we need to translate the unevenly spaced code into an evenly spaced one, meaning we make a new attribute in the vector stage that codes every 11 as 1, 12 as 2, 13 as 3, 20 as 4, 30 as 5, etc. Then when we rasterize, we should made 2 bands per raster. One for the actual uneven code that was given in the dataset so that the numbers in the raster cells match the metadata provided by the researcher, and one band for the even code just so we can web tile that and put it on the portal.

I tried this with the infrastructure dataset and got good results. See the issue comment here. But this dataset takes a while to process, so testing my theory can be done with fewer polygons.

Reproducible Example

To test my theory with a smaller data sample, I used a small sample of IWP polygons (6,000 total polygons, all near each other on Wrangel Island) and assigned 2 new attributes to the input data for the viz workflow: code_even and code_uneven.

So overall, the value range of code_even is [1,3] and the value range of code_uneven is [16,40]. Note that the distributions of each value are the same, meaning that this isolates this approach from the other palette issue described in issue#35 that describes datasets with skewed distributions of the attributes we want to visualize.

I ran the viz workflow with 2 stats, one for each code. Each stat has the same palette: red, blue, and green.

viz workflow ``` # filepaths from pathlib import Path import os # visual checks & vector data wrangling import geopandas as gpd # staging import pdgstaging from pdgstaging import TileStager # rasterization & web-tiling import pdgraster from pdgraster import RasterTiler # logging from datetime import datetime import logging import logging.handlers from pdgstaging import logging_config config = { "deduplicate_clip_to_footprint": False, "dir_input": "/home/jcohen/test_categorical_webtiling/data/", "ext_input": ".gpkg", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 12 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "code_uneven", "weight_by": "area", "property": "code_uneven", "aggregation_method": "max", "resampling_method": "nearest", "val_range": [ 16, 40 ], "palette": [ "#ff1f1f", "#4b1fff", "#1fff2a" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, { "name": "code_even", "weight_by": "area", "property": "code_even", "aggregation_method": "max", "resampling_method": "nearest", "val_range": [ 1, 3 ], "palette": [ "#ff1f1f", "#4b1fff", "#1fff2a" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [None], "deduplicate_method": None } stager = TileStager(config) stager.stage_all() RasterTiler(config).rasterize_all() ```

See the difference in the output web tiles:

code_even

The first 2,000 polygons are red, the next 2,000 are blue, and the last 2,000 are green. There is no colors in between.

even

code_uneven

The first 2,000 polygons are red, the next 2,000 are not blue, but rather pink (because 20 is closer to 16 than it is to 40), and the last 2,000 polygons are green.

uneven
katmatson commented 6 months ago

As a general approach, how does it sound to add to the config the ability to specify that a column is categorical and what the proper ordering of the values is?

For the permafrost and ground ice coverage, for example, I'm thinking that would look something like:

categorical_value: [ { prop: 'EXTENT' values: ['I', 'S', 'D', 'C'] }, ]

viz-staging would then use this to create a new property, perhaps named something like EXTENT_normalized, where a vector with EXTENT 'I' has EXTENT_normalized set to 1, EXTENT 'S' gets EXTENT_normalized 2, etc.

For properties with categorical numerical values, are the possible values known before running the viz-staging pipelines? If so, that one config would be sufficient; if not, there'd need to be something added to keep track of all seen values for that property and then add the normalized values only after going through all of the vectors the first time.