Automate multiple palette assignment to each statistic

julietcohen commented 8 months ago

During the rasterization step of the viz workflow, each statistic that was defined in the config generates one tileset of geotiffs. Importantly, each statistic contains specification for one palette that can be either a list of hex codes or the name of a palette that contains multiple colors. During the subsequent viz workflow step, web tiling, each palette is assigned to its respective tileset of geotiffs, generating one tileset of web tiles for each geotiff tileset.

Because we currently show PNG's on the portal, and do not display geotiffs, we are not currently able to generate layers in any palette that a user may specify. In other words, we cannot generate PNG tilesets on the fly.

Doug Hungarter from Google.org suggested that the users be able to choose between several predefined palettes for each data layer. For example, a user should be able to choose between green, yellow, or purple ice-wedge polygons while the ice-wedge polygons layer is toggled on, and potentially layered on top of other layers. He suggested that we automate the web tiling with the following changes:

Use Fabio Crameri’s work on Scientific Color Maps (SCL, Reference: fabiocrameri.ch/colourmaps/) as the source for the Imagery Viewer’s colorization definitions.
Create a preset order of palettes with unique definitions based on the Data Type being:
1. categorical
2. discrete
3. continuous
The first layer the user turns on is rendered using the first palette in the mode that corresponds to the layer’s Data Type. The second layer gets the second palette, etc.

We can and should differentiate between categorical and discrete color palettes:

An example of discrete is the permafrost coverage that has 4 ordered categories from the minimum permafrost coverage "isolated patches" --> "continuous" for the maximum coverage
An example of an unordered categorical dataset is the infrastructure layer

Integrating Doug's suggested changes would require:

modifying the config in viz-staging to accept multiple palettes for each statistic
adding multiple palettes as defaults for each statistic, and those defaults would depend on which of the 3 categories of data the statistic falls within (categorical, discrete, continuous)
adding functionality for the web tiling step in viz-raster to iterate through each palette for each statistic
figure out the best way for these tilesets the be written and labeled in hierarchical directories (nested within the web_tiles dir)
- currently, each statistic is a subdir within web_tiles, so if each statistic also has multiple versions (one tileset per palette), then those would need to be nested as subdirs within the statistic name within web_tiles, so it would look something like: web_tiles/permanent_water/palette_1/WGS1984Quad/12/..., web_tiles/permanent_water/palette_2/WGS1984Quad/12/... and so on
In order to create a statistic for a band during rasterization, we cannot yet ingest strings as the categorical or discrete attribute values. The pixel values must be numeric. So before staging, one step we currently take for categorical or discrete string attributes is to assign a number to each unique category as a new attribute, and that numerical attribute is what we do statistics on. So in order to automate this new approach to web tiling, we would need to integrate a step during staging to recognize when a string attribute is input for a statistic, and make it numeric.

This issue is related to viz-staging issue#25

julietcohen commented 8 months ago

Another related in-progress ticket is issue#16. For each statistic, the no data value must be defined, and it may vary per statistic. The default is currently set to be 0, but in plenty of datasets 0 is a real value.

A high priority goal of the visualization workflow is to be able to differentiate between no data and no detections. For a layer, users should be able to tell whether a region (tile) that lacks polygons is empty because the region was run through a detection model and no feature was detected versus the region was not input into the model.

We have discussed achieving this in various ways. Some examples:

including a layer for footprints of the detection files alongside the detections layer to communicate that the regions that fall within the footprints are the only regions surveyed
including a layer for the opposite: footprints for regions that were not surveyed, clarifying that the regions outside those boundaries were surveyed
- I think which of these 2 approaches makes more sense could be partially determined by the number of datasets that cover the vast majority of the Arctic with a few small gaps, versus the number of datasets that have only a few spotty regions across the extent of the Arctic

If there is existing documentation about this that I haven't linked to, anyone feel free to note it.

Notes:

Highest resolution tiles are only written for the extent of the region covered. So if we are processing a small dataset sample like 2 IWP files, we might only produce <100 staged tiles at z-level 15. While if we are processing more imput files or input files that cover more area, the number of tiles increases.
the statistics in the config include a setting for no data color, which is default as transparent. A different hex code can be specified

julietcohen commented 8 months ago

In the package, web tiling is executed when we run WebImage() within the webtile_from_geotiffs() here. webtile_from_geotiffs iterates through each stat in the config. So perhaps within this loop would be a good place to iterate through each palette in each stat (a nested loop). The palette argument for WebImage defines the default palette, and only 1 is defined.

julietcohen commented 8 months ago

As noted earlier, before iterating though each palette in each stat, we need to determine if the stats is continuous, categorical, or discrete. This may need to be an entirely new option in the statistic part of the config. Perhaps we could integrate a default for this by writing a function that determines the type of the column entered for the stat with gdf[column_name].dtype and if a string is detected then the behavior is different than if numbers are detected. The output options are described here.

PermafrostDiscoveryGateway / viz-raster

Automate multiple palette assignment to each statistic #24