FR: Configure the normalised min/max values when using `"min_max"` with `DataProcessor`

tom-andersson commented 4 months ago

TL;DR The DataProcessor hard-codes min-max normalisation to [-1, 1], but we now have a use case for mapping to [0, new_max]. So we need a new feature that makes this configurable to the user. Motivation below:

DeepSensor now supports a ConvCNP with a Bernoulli-Gamma likelihood (thanks @wesselb for implementing this upstream in neuralprocesses!), see https://github.com/alan-turing-institute/deepsensor/issues/95. This is useful for modelling lower-bounded variables which can take values exactly equal to the lower bound, e.g. precipitation which has many values of exactly zero (no rain).

The current Bernoulli-Gamma implementation in neuralprocesses hard-codes the delta/'spike' component of the Bernoulli-Gamma mixture to take a value of zero (which is reasonable). However, when normalising data with DeepSensor's DataProcessor, using data_processor(data, method="min_max") hard-codes the new min and max to be -1 and +1, respectively. This is incompatible with ConvNP(..., likelihood="bernoulli-gamma"), because any target values below 0 results in a NaN loss (and NaN weights). So, we now need to configure the new min/max of the normalised data.

I would envisage this being configurable through the __call__ so that it can be different for each variable, like normalised_data = data_processor(raw_data, method="min_max", new_min=0, new_max=1), which would be passed from __call__ to map to map_array, with new logic for "min_max" here.

A few extra bits of housekeeping needed:

Set the default values to -1 and +1 for backwards compatibility. We should also consider how this will break user's existing DataProcessor configs (which will not have these parameter values).

Store in the DataProcessor config, eg like:

'elevation': {'method': 'min_max', 'params': {'max': 4504.4375, 'min': -185.125, 'new_min': 0, 'new_max': 1}},

Unit test this in deepsensor/tests/test_data_processor.py, in particular adding this new normalisation feature to the unit test that asserts normalised data is the same after saving and loading a DataProcessor.

tom-andersson commented 4 months ago

cc @wesselb, an alternative would be to make the location of the spike in Bernoulli-Gamma configurable in neuralprocesses, and expose this through construct_convgnp. We could then keep the [-1, +1] hard-coding in DeepSensor and set the spike location to -1. I'm not sure which approach would be simpler!

wesselb commented 4 months ago

The Bernoulli-gamma distribution handles unbounded data. Instead of normalising to a bounded interval, perhaps it would be best to normalise to the unbounded the interval [0, inf)? I'm thinking something like x -> (x - min(x)) / scale(x) where scale(x) = std(x) or scale(x) = median(x).

tom-andersson commented 4 months ago

Good shout @wesselb! We could then add this as a new DataProcessor normalisation method (for example, 'min_median', or perhaps more explicitly 'positive_semidefinite').

This would avoid having to worry about backwards compatibility of the 'min_max' method (which we can keep as normalising to [-1, +1]).

tom-andersson commented 4 months ago

Specific list of steps to implement this feature:

Update list of valid methods: https://github.com/alan-turing-institute/deepsensor/blob/fbbbd9e64f1564af5e95bf243b661abd36263cf6/deepsensor/data/processor.py#L100
Compute the relevant parameters of the data (min/median or min/stddev): https://github.com/alan-turing-institute/deepsensor/blob/main/deepsensor/data/processor.py#L292-L295
Add the normalisation / unnormalisation: https://github.com/alan-turing-institute/deepsensor/blob/main/deepsensor/data/processor.py#L501-L527
Add the new methods to the call docstring: https://github.com/alan-turing-institute/deepsensor/blob/main/deepsensor/data/processor.py#L612

The test_data_processor already loops over the valid_methods of the DataProcessor, so we shouldn't need to explicitly add a unit test: https://github.com/alan-turing-institute/deepsensor/blob/main/tests/test_data_processor.py#L66

tom-andersson commented 4 months ago

Closed by: https://github.com/alan-turing-institute/deepsensor/commit/88a98182d07edcaf0ac490ea75378f15b1c45dfb

Choices:

'positive_semidefinite' for the DataProcessor normalisation method name
divide by the standard deviation for linear normalisation mapping (rather than the median), to avoid very large tail values

Thanks @wesselb for the idea!

alan-turing-institute / deepsensor

FR: Configure the normalised min/max values when using `"min_max"` with `DataProcessor` #122