PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline
Apache License 2.0
2 stars 1 forks source link

Automate setting range in config to values that best represent the palette #35

Open julietcohen opened 4 months ago

julietcohen commented 4 months ago

In many datasets, we want to visualize an attribute of the data rather than one of the custom statistics for polygon count or coverage that we use for the ice-wedge polygon data. An example can be found in the lake area time series dataset, where we visualize the areas of permanent water and seasonal water as separate layers, which are in units of hectares and are pulled straight from the vector data for each polygon. These attributes have ranges with a minimum of 0, and a max value that is an integer that varies per year, say 120,025. This data contains outliers and is not normally distributed, so setting the range in the config for these stats cannot simply be [0, 120025]. (The min and max for each z-level is calculated by default based on the raster_summary.csv if the range is not set by the user.) If we do allow the range max to be the max value of the attribute, the result is the web tiles palette does not represent all values of the data clearly on the portal. We see too many polygons with the value of the palette that represents the lower end of the values (like light blue), and few polygons in the tileset show the value of the palette that represents the largest values of the range (like dark blue). In order for users to best understand the data, there should be all values of the palette represented. By removing outliers in the range set in the config (meaning we set the max value to one that is lower than the max value in the attribute), we better represent the middle values in the palette too. Any values that fall beyond the max value in the config are set to the same color as the max in the range, so this is just like winsorizing.

Determining the best value for the max in the range would best be done mathematically, like using a percentile, or an approach that is more complicated and specific. Using a percentile was explored for the lake area time series data here. Depending on the distribution of each particular attribute, the percentile should be adjusted. This can be time consuming, so it would be best to integrate an approach into viz-staging that sets the range values to the ones that do not include outliers.

julietcohen commented 3 weeks ago

For clarity: when I refer to the val_range in the config, that is referring to this.