"tensor size match" error when training an image regression model from scratch

mohsen-saki commented 3 years ago

Issue is occurring with fastai==2.2.5, fastcore==1.3.19, nbdev==1.1.12

Describe the bug When training an image regression from scratch using RegressionBlock and MSELossFlat it throws a RuntimeError that the size of tensors a and b mismatches. It seems that the error is raised by the loss function (MSELossFlat).

Notes:

Same code work well with CategoryBlock and CrossEntropyLossFlat
Using a pre-trained model along with RegressionBlock and MSELossFlat work well
Error happens in both colab and sagemaker

To Reproduce

A dataframe with two columns; one image_path and other target_value
Create a data block with RegressionBlock
Create a Learner with xresnet18 and MSELossFlat
Start training the model see this link for complete notebook code

Expected behavior To work smoothly as it does for category problem or regression with pre-trained model.

Error with full stack trace

Alternatively, see the link above.

RuntimeError                              Traceback (most recent call last)
<ipython-input-10-db5d2a1287cb> in <module>()
----> 1 learn.fit_one_cycle(1, 3e-3)

20 frames
/usr/local/lib/python3.6/dist-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    110     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    111               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 112     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    113 
    114 # Cell

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    209             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    210             self.n_epoch = n_epoch
--> 211             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    212 
    213     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    158 
    159     def _with_events(self, f, event_type, ex, final=noop):
--> 160         try: self(f'before_{event_type}');  f()
    161         except ex: self(f'after_cancel_{event_type}')
    162         self(f'after_{event_type}');  final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_fit(self)
    200         for epoch in range(self.n_epoch):
    201             self.epoch=epoch
--> 202             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    203 
    204     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    158 
    159     def _with_events(self, f, event_type, ex, final=noop):
--> 160         try: self(f'before_{event_type}');  f()
    161         except ex: self(f'after_cancel_{event_type}')
    162         self(f'after_{event_type}');  final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_epoch(self)
    194 
    195     def _do_epoch(self):
--> 196         self._do_epoch_train()
    197         self._do_epoch_validate()
    198 

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_epoch_train(self)
    186     def _do_epoch_train(self):
    187         self.dl = self.dls.train
--> 188         self._with_events(self.all_batches, 'train', CancelTrainException)
    189 
    190     def _do_epoch_validate(self, ds_idx=1, dl=None):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    158 
    159     def _with_events(self, f, event_type, ex, final=noop):
--> 160         try: self(f'before_{event_type}');  f()
    161         except ex: self(f'after_cancel_{event_type}')
    162         self(f'after_{event_type}');  final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in all_batches(self)
    164     def all_batches(self):
    165         self.n_iter = len(self.dl)
--> 166         for o in enumerate(self.dl): self.one_batch(*o)
    167 
    168     def _do_one_batch(self):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in one_batch(self, i, b)
    182         self.iter = i
    183         self._split(b)
--> 184         self._with_events(self._do_one_batch, 'batch', CancelBatchException)
    185 
    186     def _do_epoch_train(self):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    158 
    159     def _with_events(self, f, event_type, ex, final=noop):
--> 160         try: self(f'before_{event_type}');  f()
    161         except ex: self(f'after_cancel_{event_type}')
    162         self(f'after_{event_type}');  final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_one_batch(self)
    170         self('after_pred')
    171         if len(self.yb):
--> 172             self.loss_grad = self.loss_func(self.pred, *self.yb)
    173             self.loss = self.loss_grad.clone()
    174         self('after_loss')

/usr/local/lib/python3.6/dist-packages/fastai/losses.py in __call__(self, inp, targ, **kwargs)
     33         if targ.dtype in [torch.int8, torch.int16, torch.int32]: targ = targ.long()
     34         if self.flatten: inp = inp.view(-1,inp.shape[-1]) if self.is_2d else inp.view(-1)
---> 35         return self.func.__call__(inp, targ.view(-1) if self.flatten else targ, **kwargs)
     36 
     37 # Cell

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py in forward(self, input, target)
    444 
    445     def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 446         return F.mse_loss(input, target, reduction=self.reduction)
    447 
    448 

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in mse_loss(input, target, size_average, reduce, reduction)
   2648             return handle_torch_function(
   2649                 mse_loss, tens_ops, input, target, size_average=size_average, reduce=reduce,
-> 2650                 reduction=reduction)
   2651     if not (target.size() == input.size()):
   2652         warnings.warn("Using a target size ({}) that is different to the input size ({}). "

/usr/local/lib/python3.6/dist-packages/torch/overrides.py in handle_torch_function(public_api, relevant_args, *args, **kwargs)
   1061         # Use `public_api` instead of `implementation` so __torch_function__
   1062         # implementations can do equality/identity comparisons.
-> 1063         result = overloaded_arg.__torch_function__(public_api, types, args, kwargs)
   1064 
   1065         if result is not NotImplemented:

/usr/local/lib/python3.6/dist-packages/fastai/torch_core.py in __torch_function__(self, func, types, args, kwargs)
    323         convert=False
    324         if _torch_handled(args, self._opt, func): convert,types = type(self),(torch.Tensor,)
--> 325         res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
    326         if convert: res = convert(res)
    327         if isinstance(res, TensorBase): res.set_meta(self, as_copy=True)

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in __torch_function__(cls, func, types, args, kwargs)
    993 
    994         with _C.DisableTorchFunction():
--> 995             ret = func(*args, **kwargs)
    996             return _convert(ret, cls)
    997 

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in mse_loss(input, target, size_average, reduce, reduction)
   2657         reduction = _Reduction.legacy_get_string(size_average, reduce)
   2658 
-> 2659     expanded_input, expanded_target = torch.broadcast_tensors(input, target)
   2660     return torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
   2661 

/usr/local/lib/python3.6/dist-packages/torch/functional.py in broadcast_tensors(*tensors)
     69         if any(type(t) is not Tensor for t in tensors) and has_torch_function(tensors):
     70             return handle_torch_function(broadcast_tensors, tensors, *tensors)
---> 71     return _VF.broadcast_tensors(tensors)  # type: ignore
     72 
     73 

RuntimeError: The size of tensor a (128000) must match the size of tensor b (128) at non-singleton dimension 0

Additional context Add any other context about the problem here.

tcapelle commented 3 years ago

This is cimpletely normal. your xresnet does not know how many outputs your data has. (when you call xresnet18 directly, you are using the default imagenet 1000 classes). Use the create_cnn method and explicitely say n_out=1, replace your model by this:

model = create_cnn_model(xresnet18, n_out=1)

A good practice, is to pass one input on the model manually, and check the output.

x,y = dls.one_batch()
out = model(x)
assert test_eq(out.shape, y.shape)

Cool project, what are you regressing, the wind speed from the image of the storm?

mohsen-saki commented 3 years ago

Thanks mate. Worked like a charm (though I did my homework before posting here, could not find a clear and concise explanation) Yep, The project is predicting storm speed from satellite imagery. Nothing new really :) It has been out for a couple of years already.

I suppose I should close this thread. Appreciated your help

Cheers

fastai / fastai

"tensor size match" error when training an image regression model from scratch #3193