Bug Report: Cannot Append to Gridded Data in Active Learning Algorithm

DaniJonesOcean commented 1 month ago

Explanation

The GriddedDataError: Cannot append to gridded data error indicates that the greedy algorithm's current implementation doesn't support appending new observations to gridded data structures. This limitation is likely because the data format imposes constraints on how new context points can be added, specifically for non-gridded data structures.

Bug Report: Cannot Append to Gridded Data in Active Learning Algorithm

Description

Using the active learning GreedyAlgorithm provided in deepsensor to append new observations to gridded data results in a GriddedDataError. This suggests that the current implementation does not support modifying gridded data structures dynamically, limiting the algorithm's effectiveness in active learning scenarios involving gridded data.

Steps to Reproduce

Set Up Environment:

import logging
logging.captureWarnings(True)

import deepsensor.torch
from deepsensor.model import ConvNP
from deepsensor.data import DataProcessor, TaskLoader
from deepsensor.train import set_gpu_default_device
from deepsensor.train import Trainer
from deepsensor.active_learning import GreedyAlgorithm
from deepsensor.active_learning.acquisition_fns import Stddev
import pandas as pd
import xarray as xr
import numpy as np
from tqdm import notebook

# Load datasets
dat15 ='/nfs/turbo/seas-dannes/SST-sensor-placement-input/GLSEA3_NETCDF/GLSEA3_2015.nc'
dat14 ='/nfs/turbo/seas-dannes/SST-sensor-placement-input/GLSEA3_NETCDF/GLSEA3_2014.nc'
dat16 ='/nfs/turbo/seas-dannes/SST-sensor-placement-input/GLSEA3_NETCDF/GLSEA3_2016.nc'

dat = xr.open_mfdataset([dat14, dat15, dat16],
                        concat_dim='time',
                        combine='nested',
                        chunks={'lat': 'auto', 'lon': 'auto'})

# Process data
mdat = dat.where(np.isnan(dat.sst) == False, -0.009)
climatology = mdat.groupby('time.dayofyear').mean('time')
anomalies = mdat.groupby('time.dayofyear') - climatology

data_processor = DataProcessor(x1_name="lat", x2_name="lon")
anom_ds = data_processor(anomalies)

task_loader = TaskLoader(
    context = anom_ds,
    target = anom_ds
)

val_tasks = []
for date in pd.date_range('2016-01-01T12:00:00.000000000', '2016-12-31T12:00:00.000000000'):
    N_context = np.random.randint(0, 100)
    task = task_loader(date, context_sampling="all", target_sampling="all")
    val_tasks.append(task)

model = ConvNP(data_processor, task_loader)

def compute_val_rmse(model, val_tasks):
    errors = []
    target_var_ID = task_loader.target_var_IDs[0][0]  # assume 1st target set and 1D
    for task in np.random.choice(val_tasks, 50, replace = False):
        mean = data_processor.map_array(model.mean(task), target_var_ID, unnorm=True)
        true = data_processor.map_array(task["Y_t"][0], target_var_ID, unnorm=True)
        errors.extend(np.abs(mean - true))
    return np.sqrt(np.mean(np.concatenate(errors) ** 2))

def gen_tasks(dates, progress=True):
    tasks = []
    for date in notebook.tqdm(dates, disable=not progress):
        task = task_loader(date, context_sampling=["all"], target_sampling="all")
        tasks.append(task)
    return tasks

set_gpu_default_device()
losses = []
val_rmses = []
train_range = pd.date_range('2015-01-02T12:00:00.000000000', '2015-12-31T12:00:00.000000000')
val_range = pd.date_range('2016-01-01T12:00:00.000000000', '2016-12-31T12:00:00.000000000')
val_rmse_best = np.inf
trainer = Trainer(model, lr=5e-5)
for epoch in range(5):
    train_tasks = gen_tasks(pd.date_range(train_range[0], train_range[1])[::5], progress=False)
    batch_losses = trainer(train_tasks)
    losses.append(np.mean(batch_losses))
    val_rmses.append(compute_val_rmse(model, val_tasks))  
    if val_rmses[-1] < val_rmse_best:
        val_rmse_best = val_rmses[-1]

Active Learning Algorithm:

alg = GreedyAlgorithm(
    model,
    X_s = anomalies,
    X_t = anomalies,
    context_set_idx=0,
    target_set_idx=0,
    N_new_context=3,
    progress_bar=True,
)

acquisition_fn = Stddev(model)

val_dates = pd.date_range('2016-01-01T12:00:00.000000000', '2016-12-31T12:00:00.000000000')[::5]
placement_dates = val_dates
placement_tasks = task_loader(placement_dates, context_sampling="all")

# Trigger the error
X_new_df, acquisition_fn_ds = alg(acquisition_fn, placement_tasks)

# Error: GriddedDataError: Cannot append to gridded data

Error Message

33%|███▎      | 74/222 [01:35<03:11,  1.29s/it]

---------------------------------------------------------------------------
GriddedDataError                          Traceback (most recent call last)
...
File "~/deepsensor_env_gpu/lib/python3.10/site-packages/deepsensor/data/task.py", line 377, in append_obs_to_task
    raise GriddedDataError("Cannot append to gridded data")

Expected Behavior

The active learning algorithm should support appending new observations to gridded data or handle it in a way that's compatible with gridded datasets.

Additional Context

Support for gridded data is essential for many geospatial applications, and providing robust handling of such data structures will enhance the flexibility and applicability of the active learning modules.

DaniJonesOcean commented 1 month ago

@eredding02 I can run this notebook on Google Colab with a T4 GPU. It executes without any code errors, but it does run out of ram (This is 15 GB GPU RAM and 12.7 GB CPU/System RAM). Perhaps it could be run with additional GPUs on U-M HPC?

eredding02 commented 1 month ago

@DaniJonesOcean I still get the same GriddedDataError with that linked notebook.

DaniJonesOcean commented 1 month ago

@eredding02 Ah, okay. I wonder if it's ultimately a memory issue.

A suggestion: In the meantime, perhaps after the individual lake training, you could use the ERA5 sample data and attached notebook to work on the active learning and acquisition function issues:

https://github.com/CIGLR-ai-lab/GreatLakes-TempSensors/issues/24

https://github.com/CIGLR-ai-lab/GreatLakes-TempSensors/issues/25

That might help us get our heads around this issue a bit better.

DaniJonesOcean commented 1 month ago

DJ: Next try running this on U-M HPC (more memory)

DaniJonesOcean commented 1 month ago

Will close for now to refocus, but can open up again if needed

CIGLR-ai-lab / GreatLakes-TempSensors