Figure out how to handle categorical data

dionhaefner commented 6 years ago

Challenges:

values must be mapped to colors consistently
stretching does not make sense
legend must be able to return categories
whether a dataset is categorical or not must be known at ingestion time
or is there a way to provide most of this while keeping terracotta agnostic of categories?
how big of a use case is categorical data in the real world™️?

j08lue commented 6 years ago

Another thing:

reprojection method must be nearest-neighbor to preserve the categories

mads-gras commented 6 years ago

this will be needed rather soon - the first version of terracotta supported this.

I can provide a case we can work on ;)

dionhaefner commented 6 years ago

Actually quite a challenging problem. This will require another lengthy API planning session. Looking forward to it :wink:

j08lue commented 6 years ago

This will require another lengthy API planning session. Looking forward to it 😉

Yes, you and @mrpgraae go into conclave and show some smoke when you found out...

mrpgraae commented 6 years ago

The Terracotta Council of Elders concludes... ☁️ ☁️ ☁️

There is no good way to implement categorical datasets in a way that Terracotta is agnostic about them. We will have to implement special cases and features for categorical datasets.

Split `/legend` into `/legend` and `/colormap`

/legend should be renamed to /colormap, since that is more descriptive of what the call actually returns. A call to /legend/{keys} shall henceforth return the names of the categories in a categorical dataset and their associated hex color string. Calling /legend on a non-categorical dataset returns empty dict.

New parameter for `driver.insert`

Add a new parameter called categories which should be a list of Category named tuples (could be dataclasses in the future). The Category named tuple has 3 attributes:

value: number-type of the raster value that represents the category
color: 3-tuple of 0..255 RGB values
name: str defining the name of the category

A new column Categories will be added to the database. The value will be a VARCHAR containing a JSON encoding of the categories. For non-categorical datasets, this column will be null. The presence of this defines whether or not a dataset is categorical.

The ugly part

We will need to add branches in the low-level functions to handle the categorical case:

Resampling for reprojection should be nearest neighbor
Stretch parameters should be ignored
Color mapping must be done according to user-specified color values

Bonus features

When terracotta optimize-rasters is used to cloud-optimize a raster, we should set a GeoTIFF tag specifying what resampling method was used for the overviews. We can then warn the user if they are trying to add a dataset as categorical, when they used something other than nearest as resampling method.

We could allow users to not specify colors for the categories and then auto-generate a nice color cycle for them. This could be done with something like an np.linspace index into the Viridis colormap.

dionhaefner commented 6 years ago

Thoughts:

We shouldn't call the legend API endpoint colormap, since we already have colormaps.
categories might be a better name than legend, since it makes it clearer that it doesn't make sense to pass non-categorical data.
Do we even need a separate endpoint? We could return it with metadata, which shouldn't be much of an issue unless there are thousands of categories in an image.
Not convinced about the insert API anymore. Since it doesn't make sense to specify colors for some pixel values, it could just accept two arguments (categories and colormap, for instance, where colormap can either be the name of one that is built-in, or an actual mapping)

dionhaefner commented 6 years ago

Category-agnostic Terracotta

Recipe to create categorical datasets:

Create keys [type, sensor, date, band]
Ingest categorical data with type=categorical, and other data with type=index or type=reflectance or whatever
During ingestion, add category mapping to extra_metadata, in the form of {category: pixel_value}
In the frontend, get all categorical datasets via /datasets?type=categorical
Get categories via /metadata (includes ingested extra_metadata)
Get imagery via /singleband/categorical/S2/20180820/classification/{z}/{x}/{y}.png?colormap={pixel_value: color, ...} (supplying mapping like this suppresses stretching and uses nearest resampling)

Pros

Only new backend code would be support for explicit color mappings in /singleband?colormap=...
We get a search feature for categorical data for free (via /datasets)
All color choices are done in the frontend
Perfectly clear from the first query which datasets are categorical, and which color mapping behavior is used
No new endpoints
Follows terracotta philosophy ("there are only keys")

Cons

People can still do /rgb and /singleband without manual colormap and receive mild to extreme garbage
We can't warn if categorical data gets ingested with non-nearest resampling
No default color mappings for categorical data, users have to specify one (but we could recommend color brewer or so, or users could extract colors from a matplotlib colormap via /legend)

Whether we should go for this or not depends on how explicit we want to be in supporting categorical data. Is it a niche use case or a core feature? Can we afford to annoy the users a little with this somewhat hacky recipe?

dionhaefner commented 6 years ago

Implemented. We'll see how this recipe works in practice. If it proves to be too cumbersome we can still introduce explicit support for categories by supplying them directly to driver.ingest.

DHI / terracotta