alan-turing-institute / deepsensor

A Python package for tackling diverse environmental prediction tasks with NPs.
https://alan-turing-institute.github.io/deepsensor/
MIT License
94 stars 16 forks source link

TaskLoader fails when declaring multiple `target_delta_t` #129

Closed acocac closed 1 month ago

acocac commented 2 months ago

I start experimenting a forecasting set up in DeepSensor (see a MWE in colab). The example below shows how I define a TaskLoader for predicting air temperature in the next two days (lead times):

task_loader = TaskLoader(
    context=[era5_ds["air"],] * 3,
    context_delta_t=[-1, -2, 0],
    target=[era5_ds["air"],era5_ds["air"]],
    target_delta_t=[1, 2],
    time_freq="D",  # daily frequency (the default)
)

Then I reuse the training procedure suggested in DeepSensor tutorials. However, the training stops and gives an error when computing RMSE for the validation tasks.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-22-d73321d37ac8>](https://7773me6r0z9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240916-060345_RC00_675086238#) in <cell line: 16>()
     18     batch_losses = trainer(train_tasks)
     19     losses.append(np.mean(batch_losses))
---> 20     val_rmses.append(compute_val_rmse(model, val_tasks))
     21     if val_rmses[-1] < val_rmse_best:
     22         val_rmse_best = val_rmses[-1]

1 frames
[/usr/local/lib/python3.10/dist-packages/deepsensor/data/processor.py](https://7773me6r0z9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240916-060345_RC00_675086238#) in map_array(self, data, var_ID, method, unnorm, add_offset)
    516             c = -c / m
    517             m = 1 / m
--> 518         data = data * m
    519         if add_offset:
    520             data = data + c

TypeError: can't multiply sequence by non-int of type 'float'

My guess is that some changes should be required in map_array when considering multiple targets. I suggest recognising the object type of data below. If it's a list, then perform the multiply operator per element, in this case np.array.

https://github.com/alan-turing-institute/deepsensor/blob/6de4ddb566db64d35a93fa5e8e1ab4f4327bb8e4/deepsensor/data/processor.py#L518

tom-andersson commented 1 month ago

Hey @acocac, thanks for raising this and the MWE. So you have two target sets for the two lead times, and you want to compute unnormalised RMSE in Kelvin for the first lead time. The model.predict interface is the intended way to get unnormalised predictions for computing unnormalised metrics. I've recently improved DeepSensor's forecasting functionality in deepsensor v0.4 which fixes model.predict forecast outputs; see https://github.com/alan-turing-institute/deepsensor/issues/130 and https://github.com/alan-turing-institute/deepsensor/pull/132.

However, in the MWE, you are not using the data_processor in the right way. .map_array is intended for a single array, not a list, so I suggest we keep the interface as-is. As a workaround, keeping the current approach:

# Don't do this:
# mean = data_processor.map_array(model.mean(task), target_var_ID, unnorm=True)
# true = data_processor.map_array(task["Y_t"][0], target_var_ID, unnorm=True)
# Do this:
lead_time_idx = 0
mean = model.mean(task)[lead_time_idx]
true = task["Y_t"][lead_time_idx]
error = np.abs(mean - true)
error_unnormalised = data_processor.map_array(error, target_var_ID, unnorm=True, add_offset=False)

But I'd suggest updating DeepSensor and using model.predict :-)