CIGLR-ai-lab / GreatLakes-TempSensors

Collaborative repository for optimizing the placement of temperature sensors in the Great Lakes using the DeepSensor machine learning framework. Aiming to enhance the quantitative understanding of surface temperature variability for better environmental monitoring and decision-making.
MIT License
0 stars 0 forks source link

Bug Report: Cannot Append to Gridded Data in Active Learning Algorithm #28

Closed DaniJonesOcean closed 1 month ago

DaniJonesOcean commented 1 month ago

Explanation

The GriddedDataError: Cannot append to gridded data error indicates that the greedy algorithm's current implementation doesn't support appending new observations to gridded data structures. This limitation is likely because the data format imposes constraints on how new context points can be added, specifically for non-gridded data structures.


Bug Report: Cannot Append to Gridded Data in Active Learning Algorithm

Description

Using the active learning GreedyAlgorithm provided in deepsensor to append new observations to gridded data results in a GriddedDataError. This suggests that the current implementation does not support modifying gridded data structures dynamically, limiting the algorithm's effectiveness in active learning scenarios involving gridded data.

Steps to Reproduce

  1. Set Up Environment:

    import logging
    logging.captureWarnings(True)
    
    import deepsensor.torch
    from deepsensor.model import ConvNP
    from deepsensor.data import DataProcessor, TaskLoader
    from deepsensor.train import set_gpu_default_device
    from deepsensor.train import Trainer
    from deepsensor.active_learning import GreedyAlgorithm
    from deepsensor.active_learning.acquisition_fns import Stddev
    import pandas as pd
    import xarray as xr
    import numpy as np
    from tqdm import notebook
    
    # Load datasets
    dat15 ='/nfs/turbo/seas-dannes/SST-sensor-placement-input/GLSEA3_NETCDF/GLSEA3_2015.nc'
    dat14 ='/nfs/turbo/seas-dannes/SST-sensor-placement-input/GLSEA3_NETCDF/GLSEA3_2014.nc'
    dat16 ='/nfs/turbo/seas-dannes/SST-sensor-placement-input/GLSEA3_NETCDF/GLSEA3_2016.nc'
    
    dat = xr.open_mfdataset([dat14, dat15, dat16],
                            concat_dim='time',
                            combine='nested',
                            chunks={'lat': 'auto', 'lon': 'auto'})
    
    # Process data
    mdat = dat.where(np.isnan(dat.sst) == False, -0.009)
    climatology = mdat.groupby('time.dayofyear').mean('time')
    anomalies = mdat.groupby('time.dayofyear') - climatology
    
    data_processor = DataProcessor(x1_name="lat", x2_name="lon")
    anom_ds = data_processor(anomalies)
    
    task_loader = TaskLoader(
        context = anom_ds,
        target = anom_ds
    )
    
    val_tasks = []
    for date in pd.date_range('2016-01-01T12:00:00.000000000', '2016-12-31T12:00:00.000000000'):
        N_context = np.random.randint(0, 100)
        task = task_loader(date, context_sampling="all", target_sampling="all")
        val_tasks.append(task)
    
    model = ConvNP(data_processor, task_loader)
    
    def compute_val_rmse(model, val_tasks):
        errors = []
        target_var_ID = task_loader.target_var_IDs[0][0]  # assume 1st target set and 1D
        for task in np.random.choice(val_tasks, 50, replace = False):
            mean = data_processor.map_array(model.mean(task), target_var_ID, unnorm=True)
            true = data_processor.map_array(task["Y_t"][0], target_var_ID, unnorm=True)
            errors.extend(np.abs(mean - true))
        return np.sqrt(np.mean(np.concatenate(errors) ** 2))
    
    def gen_tasks(dates, progress=True):
        tasks = []
        for date in notebook.tqdm(dates, disable=not progress):
            task = task_loader(date, context_sampling=["all"], target_sampling="all")
            tasks.append(task)
        return tasks
    
    set_gpu_default_device()
    losses = []
    val_rmses = []
    train_range = pd.date_range('2015-01-02T12:00:00.000000000', '2015-12-31T12:00:00.000000000')
    val_range = pd.date_range('2016-01-01T12:00:00.000000000', '2016-12-31T12:00:00.000000000')
    val_rmse_best = np.inf
    trainer = Trainer(model, lr=5e-5)
    for epoch in range(5):
        train_tasks = gen_tasks(pd.date_range(train_range[0], train_range[1])[::5], progress=False)
        batch_losses = trainer(train_tasks)
        losses.append(np.mean(batch_losses))
        val_rmses.append(compute_val_rmse(model, val_tasks))  
        if val_rmses[-1] < val_rmse_best:
            val_rmse_best = val_rmses[-1]
  2. Active Learning Algorithm:

    alg = GreedyAlgorithm(
        model,
        X_s = anomalies,
        X_t = anomalies,
        context_set_idx=0,
        target_set_idx=0,
        N_new_context=3,
        progress_bar=True,
    )
    
    acquisition_fn = Stddev(model)
    
    val_dates = pd.date_range('2016-01-01T12:00:00.000000000', '2016-12-31T12:00:00.000000000')[::5]
    placement_dates = val_dates
    placement_tasks = task_loader(placement_dates, context_sampling="all")
    
    # Trigger the error
    X_new_df, acquisition_fn_ds = alg(acquisition_fn, placement_tasks)
    
    # Error: GriddedDataError: Cannot append to gridded data

Error Message

33%|███▎      | 74/222 [01:35<03:11,  1.29s/it]

---------------------------------------------------------------------------
GriddedDataError                          Traceback (most recent call last)
...
File "~/deepsensor_env_gpu/lib/python3.10/site-packages/deepsensor/data/task.py", line 377, in append_obs_to_task
    raise GriddedDataError("Cannot append to gridded data")

Expected Behavior

The active learning algorithm should support appending new observations to gridded data or handle it in a way that's compatible with gridded datasets.

Suggested Solution

  1. Handle Gridded Data: Modify the append_obs_to_task method or the relevant parts of the GreedyAlgorithm to support appending observations to gridded data.

  2. Documentation Improvement: Update the documentation to clearly state any limitations regarding the type of data structures supported by the active learning algorithms, along with possible workarounds for gridded data.

Additional Context

Support for gridded data is essential for many geospatial applications, and providing robust handling of such data structures will enhance the flexibility and applicability of the active learning modules.

DaniJonesOcean commented 1 month ago

@eredding02 I can run this notebook on Google Colab with a T4 GPU. It executes without any code errors, but it does run out of ram (This is 15 GB GPU RAM and 12.7 GB CPU/System RAM). Perhaps it could be run with additional GPUs on U-M HPC?

eredding02 commented 1 month ago

@DaniJonesOcean I still get the same GriddedDataError with that linked notebook.

DaniJonesOcean commented 1 month ago

@eredding02 Ah, okay. I wonder if it's ultimately a memory issue.

A suggestion: In the meantime, perhaps after the individual lake training, you could use the ERA5 sample data and attached notebook to work on the active learning and acquisition function issues:

https://github.com/CIGLR-ai-lab/GreatLakes-TempSensors/issues/24

https://github.com/CIGLR-ai-lab/GreatLakes-TempSensors/issues/25

That might help us get our heads around this issue a bit better.

DaniJonesOcean commented 1 month ago

DJ: Next try running this on U-M HPC (more memory)

DaniJonesOcean commented 1 month ago

Will close for now to refocus, but can open up again if needed