alan-turing-institute / deepsensor

A Python package for tackling diverse environmental prediction tasks with NPs.
https://alan-turing-institute.github.io/deepsensor/
MIT License
72 stars 15 forks source link

`TaskLoader` makes copies of data, leading to duplication in memory #82

Open tom-andersson opened 11 months ago

tom-andersson commented 11 months ago

In many deepsensor modelling scenarios, the user will have the same dataset (xarray or pandas) on the context and target side of the TaskLoader. Clearly, the TaskLoader should be using the same object in memory in these cases. However, part of the processing in the TaskLoader is returning a copy of the data objects. Since different pointers are used for the context and target data, this results in duplication in memory. See code example below.

import deepsensor.torch
from deepsensor.data import DataProcessor, TaskLoader
from deepsensor.model import ConvNP
from deepsensor.train import Trainer

import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm

# Load raw data
ds_raw = xr.tutorial.open_dataset("air_temperature")

# Normalise data
data_processor = DataProcessor(x1_name="lat", x2_name="lon")
ds = data_processor(ds_raw)

task_loader = TaskLoader(context=ds, target=ds)
>>> print(task_loader.context[0] is task_loader.target[0])
False

One solution is to use a hashmap/dict which is shared between the context and target data. Some thought would be needed on what the keys should be in the hashmap, and how the context and target lists should link to those entries.

We will need to test this for both xarray/pandas cases and also the case where the context/target entries are fpaths rather than xarray/pandas objects.