Closed nilsleh closed 1 year ago
Thanks for raising this @nilsleh - if you are using your local version of deepsensor
for this, would you be able to print(task)
before the batch_loss = train_step(task)
in the train_epoch
function? We can then check all the shapes conform to n_batches, n_features, *n_obs
, and that the context n_features
plus the number of context sets (for the density channels) adds up to 8.
This is for batch_size=1
:
time: 2011-06-01 00:00:00
ops: []
X_c: [(2, 246), ((1, 240), (1, 400)), ((1, 72), (1, 120))]
Y_c: [(1, 246), (1, 240, 400), (3, 72, 120)]
X_t: [((1, 241), (1, 401))]
Y_t: [(1, 241, 401)]
And this for batch_size=4
:
time: [Timestamp('2016-02-27 00:00:00'), Timestamp('2015-06-14 00:00:00'), Timestamp('2018-12-05 00:00:00'), Timestamp('2015-03-30 00:00:00')]
ops: ['batch_dim', 'float32', 'numpy_mask', 'nps_mask']
X_c: [(4, 2, 246), ((4, 1, 240), (4, 1, 400)), ((4, 1, 72), (4, 1, 120))]
Y_c: [<neuralprocesses.mask.Masked object at 0x7facf14ddcc0>, <neuralprocesses.mask.Masked object at 0x7facf14df6d0>, <neuralprocesses.mask.Masked object at 0x7facf14ddf60>]
X_t: [((4, 1, 241), (4, 1, 401))]
Y_t: [(4, 1, 241, 401)]
The shapes for the masked objects are: (4, 1, 246), (4, 1, 240, 400), (4, 3, 72, 120)
Thanks @nilsleh! Those shapes all check out. The bug is caused by one of the context sets having >1 dimensions. Here's a MWE which produces the same error:
import deepsensor.torch
from deepsensor.data import DataProcessor, TaskLoader
from deepsensor.model import ConvNP
from deepsensor.train import Trainer
import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm
# Load raw data
ds_raw = xr.tutorial.open_dataset("air_temperature")
# Add extra dim
ds_raw["air2"] = ds_raw["air"].copy()
# Normalise data
data_processor = DataProcessor(x1_name="lat", x2_name="lon")
ds = data_processor(ds_raw)
# Set up task loader
task_loader = TaskLoader(context=ds, target=ds)
# Set up model
model = ConvNP(data_processor, task_loader)
# Generate training tasks with up 100 grid cells as context and all grid cells
# as targets
train_tasks = []
for date in pd.date_range("2013-01-01", "2014-11-30")[::7]:
N_context = np.random.randint(0, 100)
task = task_loader(date, context_sampling=N_context, target_sampling="all")
train_tasks.append(task)
# Train model
trainer = Trainer(model, lr=5e-5)
for epoch in tqdm(range(10)):
batch_losses = trainer(train_tasks, batch_size=4)
...
RuntimeError: Given groups=1, weight of size [64, 3, 5, 5], expected input[4, 4, 48, 80] to have 3 channels, but got 4 channels instead
@wesselb, do you remember when we found that calling nps.merge_contexts
on multi-dimensional context sets resulted in repeated density channels? I'm pretty sure that's what's going on here again, and my hacky solution was to manually override the mask of the merged nps.Masked
context observation objects like: task["Y_c"][i].mask = task["Y_c"][i].mask[:, 0:1, :]
. Any chance something could be going wrong under the hood in neuralprocesses
, either in merge_contexts
or the way the ConvNP encoder uses the .mask
attr of nps.Masked
objects?
N.B Our training unit test includes batching and is passing, but it only tests a 1D context set. Once we patch this bug we should add a test with an N-D context set.
@tom-andersson Ah, I don't quite recall precisely what that problem was. :( Any chance you could post a small example of the repeated density channel issue here?
Hey @wesselb, I created an MWE in pure neuralprocesses
and there's no error, so it must be on the DeepSensor side.
My hypothesis is that it's something to do with applying a numpy NaN mask after merging the context sets into nps.Masked
objects: https://github.com/tom-andersson/deepsensor/blob/438295797b57bfc2b206d4e3c9a5079c9ed802bb/deepsensor/data/task.py#L554-L558
I'll dig into this.
Yeah, found it. Was slightly esoteric but it was an array shape bug in the way NaNs were being removed from the nps.Masked
objects that come out of nps.merge_contexts
. I've added a unit test for batch-wise training with multi-dimensional context sets and this is now passing.
Fixed in v0.3.5 on PyPI, thanks for catching this @nilsleh and thanks @wesselb for helping me realise the bug was on the deepsensor
side, not the neuralprocesses
side!
Ah, I'm glad to hear that you managed to find the bug, @tom-andersson! :)
Description
I am running the new User Guide Training Notebook to better understand the details of Conv Training. I downloaded the jupyter notebook and by default it runs fine, however, I wanted to run training with a defined batch size and therefore added a
batch_size
argument to the trainer. But ifbatch_size>1
, like 4 in this example then I get a shape error in a neuralprocess conv layer:I find it quiet counterintuitive that changing the batch size argument would have such an effect, because I didn't find another place where I would have to change the batch size in the code, or adapt something. So
batch_size=1
works, butbatch_size>1
leads to error.These are the shape outputs from before entering the decoder with
batch_size = 1
,with
batch_size = 4
,Thus, it seems like the sampled
z
from theDirac
pz
output from the encoder is causing the difference, but I didn't get further into debugging yet about why changing the batch size would have that effect.Reproduction steps
Version
0.3.4
Screenshots
OS
Linux