Closed davidwilby closed 4 months ago
@davidwilby Not sure if this helps, I ran into something similar a while ago where the target sets have different number of targets across tasks while batching, and I adapted the concat_tasks
function to randomly subsample the targets to a common batch size:
for target_set_i in range(n_target_sets):
# Raise error if target sets have different numbers of targets across tasks
n_target_obs = [task["Y_t"][target_set_i].size for task in tasks]
if not all([n == n_target_obs[0] for n in n_target_obs]):
# for this target set adapt the number of observations across tasks to min_n_target_obs
shapes = [task["Y_t"][target_set_i].shape[-1] for task in tasks]
min_n = min(shapes)
for task in tasks:
rand_indices = np.random.choice(
np.arange(task["Y_t"][target_set_i].shape[-1]),
size=min_n,
replace=False,
)
task["Y_t"][target_set_i] = task["Y_t"][target_set_i][..., rand_indices]
task["X_t"][target_set_i] = task["X_t"][target_set_i][..., rand_indices]
Not sure if this helps, but I would agree that it would be nice to include some functionality that handles this for batched training as this can happen quiet frequently.
Hi @davidwilby + @MartinSJRogers, thank you for raising this :) this boils down to a few things:
neuralprocesses.Masked
objects which are constructed by the Task.mask_nans_{numpy,nps}
methods you mentioned.deepsensor
/neuralprocesses
for padding the target arrays and then masking the padded values from the loss).The workaround, as suggested in the error message, is to manually run the model multiple times in a for loop over your 'batch', and then average the losses within your model update. This gives you the smoother loss surface of batch training, but unfortunately it doesn't give you the computational efficiency of running on multiple examples in parallel on a GPU.
@nilsleh's workaround of subsampling to the smallest number of targets is a nice idea, although the model will see fewer target points per batch than it would otherwise, so this is a trade-off between computational efficiency and learning efficiency. If the number of non-missing target points are similar between all Task
s, which looks to be the case from [9460279, 10432117, 8255541, 10345501]
, then it might not be a bad shout.
When I remove the call to remove_target_nans here for testing, of course the batch sizes are the same and the ValueError above isn't raised, the rest of concat_tasks runs and masknans{numpy,nps} are called successfully. This, however, results in an error further down the line in which the Masked object from neuralprocesses is found not to have the dtype attribute:
neuralprocesses
stack traces can be confusing and the dtype error isn't clear, but you can't have neuralprocesses.Masked
objects in the targets. The targets need to be vanilla tensors. Only context data can have neuralprocesses.Masked
objects, and the missing data will be dealt with under the hood, as mentioned above.
Hope this clears things up, and please close if so :)
I have a related question to NaNs in target sets, which is the case for the data that I am working with. If I don't modify anything and use the provided Trainer code as such:
trainer = Trainer(model, lr=5e-5)
batch_losses = trainer(train_tasks, batch_size=None)
A single task looks like this in the loss_fn
computation after the modify_task call here
time: Timestamp/2013-06-15 12:00:00
ops: ['str/batch_dim', 'str/float32', 'str/numpy_mask', 'str/nps_mask', 'str/tensor']
X_c: ['Tensor/torch.float32/torch.Size([1, 2, 256000])', 'Tensor/torch.float32/torch.Size([1, 2, 256000])', 'Tensor/torch.float32/torch.Size([1, 2, 256000])']
Y_c: ['Masked/(y=torch.float32/torch.Size([1, 1, 256000]))/(mask=torch.float32/torch.Size([1, 1, 256000]))', 'Masked/(y=torch.float32/torch.Size([1, 1, 256000]))/(mask=torch.float32/torch.Size([1, 1, 256000]))', 'Tensor/torch.float32/torch.Size([1, 1, 256000])']
X_t: ['Tensor/torch.float32/torch.Size([1, 2, 224000])']
Y_t: ['Masked/(y=torch.float32/torch.Size([1, 1, 224000]))/(mask=torch.float32/torch.Size([1, 1, 224000]))']
And I get the error: AttributeError: 'Masked' object has no attribute 'dtype'
If I change the loss function to remove the targets before modifying the task by adding task.remove_target_nans()
, while keeping batch_size=None
or just changing the trainer batch_size>2
because then remove_target_nans()
is called in concat_tasks
a single task looks like this:
time: Timestamp/2013-06-15 12:00:00
ops: ['str/target_nans_removed', 'str/batch_dim', 'str/float32', 'str/numpy_mask', 'str/nps_mask', 'str/tensor']
X_c: ['Tensor/torch.float32/torch.Size([1, 2, 256000])', 'Tensor/torch.float32/torch.Size([1, 2, 256000])', 'Tensor/torch.float32/torch.Size([1, 2, 256000])']
Y_c: ['Masked/(y=torch.float32/torch.Size([1, 1, 256000]))/(mask=torch.float32/torch.Size([1, 1, 256000]))', 'Masked/(y=torch.float32/torch.Size([1, 1, 256000]))/(mask=torch.float32/torch.Size([1, 1, 256000]))', 'Tensor/torch.float32/torch.Size([1, 1, 256000])']
X_t: ['Tensor/torch.float32/torch.Size([1, 2, 215843])']
Y_t: ['Tensor/torch.float32/torch.Size([1, 1, 215843])']
but then I get the neuralprocess
library error: AssertionError: Expected not a parallel of elements, but got inputs and outputs in parallel.
And I am not sure what I have done wrong.
Hi @nilsleh, the AttributeError: 'Masked' object has no attribute 'dtype'
is exactly what is described above - essentially you can't train a model with NaNs in targets. It is unfortunate that when target NaNs are present the error message is confusing. As you say, using batch_size > 1
means target NaNs are automatically removed within the concat_tasks
method.
Regarding the AssertionError: Expected not a parallel of elements, but got inputs and outputs in parallel
, I have never seen that neuralprocesses
error before. The shape of the Task
looks fine, but I am missing context for what exact code you call prior to this. Would you be able to produce an MWE in a Colab by generating random data?
Hi @tom-andersson thanks for the reply, I have created a gist with the accompanying data I am using. The data tar file also contains the normalization parameters for the data processor.
EDIT: I was able to resolve it thanks to Wessel, it was a misconfiguration of the data processor and model.
Glad you could solve this @nilsleh - to copy over your solution from the neuralprocesses
GitHub for future reference:
I had forgotten to pass in the task_loader as an argument to the ConvNP model as I was using multiple context sets. And that just initializes a model with default parameters than then result in a mismatch, when you try to pass in your "actual" data.
There were no complaints about closing this issue, so closing now.
@tom-andersson et al. I wonder if you can help clear up some difficulty that @MartinSJRogers and I are having with batched training for gridded data with missing values using deepsensor.
As yet I'm unable to work out whether we're doing something incorrectly or whether there are bugs in deepsensor's implementation.
We're working with gridded data with missing values represented as NaNs as specified in the Data Requirements section of the docs.
When setting a
batch_size
during training,concat_tasks
is called, in the below snippet, theremove_target_nans()
method is called:https://github.com/alan-turing-institute/deepsensor/blob/aeccc097699c5963fa81b4601cfa11fad5daa41b/deepsensor/data/task.py#L477-L489
This results in a
ValueError
raised later inconcat_tasks
since there are different numbers of targets in each batch:and as a result we don't get to the calls to
mask_nans_numpy
andmask_nans_nps
towards the end ofconcat_tasks
.I'm confused by this, since from the message above
"Cannot concatenate tasks that have had NaNs masked. " "Masking will be applied automatically after concatenation."
and the later call tomask_nans_{numpy,nps}
it seems like this should be handled by those methods.When I remove the call to
remove_target_nans
here for testing, of course the batch sizes are the same and theValueError
above isn't raised, the rest ofconcat_tasks
runs andmask_nans_{numpy,nps}
are called successfully.This, however, results in an error further down the line in which the
Masked
object fromneuralprocesses
is found not to have thedtype
attribute:Full stack trace
``` --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[12], line 15 13 trainer = Trainer(model, lr=5e-5) 14 for epoch in tqdm(range(1)): ---> 15 batch_losses = trainer(train_tasks, tqdm_notebook=True, batch_size=None) # error here due to filesize. I have attempted using batch_size = n, 16 # but get seperate error asserting that the number of targets in each batch must be the same. 17 # Todo- work out how to calculate number of targets in each batch, and ensure the batch size allows me to honour this assertion. 18 losses.append(np.mean(batch_losses)) File _/deepsensor/train/train.py:177, in Trainer.__call__(self, tasks, batch_size, progress_bar, tqdm_notebook) 170 def __call__( 171 self, 172 tasks: List[Task], (...) 175 tqdm_notebook=False, 176 ) -> List[float]: --> 177 return train_epoch( 178 model=self.model, 179 tasks=tasks, 180 batch_size=batch_size, 181 opt=self.opt, 182 progress_bar=progress_bar, 183 tqdm_notebook=tqdm_notebook, 184 ) File _/deepsensor/train/train.py:145, in train_epoch(model, tasks, lr, batch_size, opt, progress_bar, tqdm_notebook) 143 else: 144 task = tasks[batch_i] --> 145 batch_loss = train_step(task) 146 batch_losses.append(batch_loss) 148 return batch_losses File _/deepsensor/train/train.py:116, in train_epoch.Are we doing something incorrectly here? Or are there bugs in the implementation. Happy to add more docs when we've worked out what we're doing or contribute bug fixes if required!
Lastly, for non-batched training, are
mask_nans_{numpy,nps}
run somewhere else? I notice that they're called inmodify_task
but I'm not yet sure when this is called.