abhishekkrthakur / tez

Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.
Apache License 2.0
1.16k stars 145 forks source link

ValueError: operands could not be broadcast together with shapes (256,256,4) (3,) (256,256,4) #8

Closed nvnvashisth closed 3 years ago

nvnvashisth commented 3 years ago

I am trying to use this package, and it is throwing as below. I am using the same pipeline from cassava lead detection problem but on different set where image size is (256, 256)

Could you please help here.

Downloading: "https://github.com/lukemelas/EfficientNet-PyTorch/releases/download/1.0/efficientnet-b4-6ed6700e.pth" to /root/.cache/torch/hub/checkpoints/efficientnet-b4-6ed6700e.pth 100% 74.4M/74.4M [00:00<00:00, 107MB/s]

Loaded pretrained weights for efficientnet-b4 0%| | 0/51 [00:00<?, ?it/s]

ValueError Traceback (most recent call last)

in () 11 epochs=10, 12 callbacks=[es], ---> 13 fp16=True, 14 ) 15 model.save("model.bin") 6 frames /usr/local/lib/python3.6/dist-packages/tez/model/model.py in fit(self, train_dataset, valid_dataset, train_sampler, valid_sampler, device, epochs, train_bs, valid_bs, n_jobs, callbacks, fp16) 295 self.train_state = enums.TrainingState.EPOCH_START 296 self.train_state = enums.TrainingState.TRAIN_EPOCH_START --> 297 train_loss = self.train_one_epoch(self.train_loader, device) 298 self.train_state = enums.TrainingState.TRAIN_EPOCH_END 299 if self.valid_loader: /usr/local/lib/python3.6/dist-packages/tez/model/model.py in train_one_epoch(self, data_loader, device) 176 losses = AverageMeter() 177 tk0 = tqdm(data_loader, total=len(data_loader)) --> 178 for b_idx, data in enumerate(tk0): 179 self.train_state = enums.TrainingState.TRAIN_STEP_START 180 loss, metrics = self.train_one_step(data, device) /usr/local/lib/python3.6/dist-packages/tqdm/std.py in __iter__(self) 1102 fp_write=getattr(self.fp, 'write', sys.stderr.write)) 1103 -> 1104 for obj in iterable: 1105 yield obj 1106 # Update and possibly print the progressbar. /usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self) 433 if self._sampler_iter is None: 434 self._reset() --> 435 data = self._next_data() 436 self._num_yielded += 1 437 if self._dataset_kind == _DatasetKind.Iterable and \ /usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _next_data(self) 1083 else: 1084 del self._task_info[idx] -> 1085 return self._process_data(data) 1086 1087 def _try_put_index(self): /usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data) 1109 self._try_put_index() 1110 if isinstance(data, ExceptionWrapper): -> 1111 data.reraise() 1112 return data 1113 /usr/local/lib/python3.6/dist-packages/torch/_utils.py in reraise(self) 426 # have message field 427 raise self.exc_type(message=msg) --> 428 raise self.exc_type(msg) 429 430 ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.6/dist-packages/tez/datasets/image_classification.py", line 48, in __getitem__ augmented = self.augmentations(image=image) File "/usr/local/lib/python3.6/dist-packages/albumentations/core/composition.py", line 171, in __call__ data = t(**data) File "/usr/local/lib/python3.6/dist-packages/albumentations/core/transforms_interface.py", line 38, in __call__ res[key] = target_function(arg, **dict(params, **target_dependencies)) File "/usr/local/lib/python3.6/dist-packages/albumentations/augmentations/transforms.py", line 808, in apply return F.normalize(image, self.mean, self.std, self.max_pixel_value) File "/usr/local/lib/python3.6/dist-packages/albumentations/augmentations/functional.py", line 93, in normalize img -= mean ValueError: operands could not be broadcast together with shapes (256,256,4) (3,) (256,256,4)
abhishekkrthakur commented 3 years ago

It seems like your image is RGBA. Can you convert it to RGB? or do you have to use RGBA? If latter, then try writing your own dataloader. Can you provide data and code to reproduce the error?

nvnvashisth commented 3 years ago

Ok I converted everything to RGB. I have label from [0-9] with image size 256x256. But I come across this CUDA error. Another thing, I am trying to execute in Colab.

Regarding the code, it is exactly taken from here https://www.kaggle.com/abhishek/tez-faster-and-easier-training-for-leaf-detection ;)

Loaded pretrained weights for efficientnet-b4

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-29-de0853739e51> in <module>()
     11     epochs=10,
     12     callbacks=[es],
---> 13     fp16=True
     14 )
     15 model.save("model.bin")

6 frames
/usr/local/lib/python3.6/dist-packages/tez/model/model.py in fit(self, train_dataset, valid_dataset, train_sampler, valid_sampler, device, epochs, train_bs, valid_bs, n_jobs, callbacks, fp16)
    289             n_jobs=n_jobs,
    290             callbacks=callbacks,
--> 291             fp16=fp16,
    292         )
    293 

/usr/local/lib/python3.6/dist-packages/tez/model/model.py in _init_model(self, device, train_dataset, valid_dataset, train_sampler, valid_sampler, train_bs, valid_bs, n_jobs, callbacks, fp16)
     81 
     82         if next(self.parameters()).device != device:
---> 83             self.to(device)
     84 
     85         if self.train_loader is None:

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
    610             return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
    611 
--> 612         return self._apply(convert)
    613 
    614     def register_backward_hook(

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
    357     def _apply(self, fn):
    358         for module in self.children():
--> 359             module._apply(fn)
    360 
    361         def compute_should_use_set_data(tensor, tensor_applied):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
    357     def _apply(self, fn):
    358         for module in self.children():
--> 359             module._apply(fn)
    360 
    361         def compute_should_use_set_data(tensor, tensor_applied):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
    379                 # `with torch.no_grad():`
    380                 with torch.no_grad():
--> 381                     param_applied = fn(param)
    382                 should_use_set_data = compute_should_use_set_data(param, param_applied)
    383                 if should_use_set_data:

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in convert(t)
    608             if convert_to_format is not None and t.dim() == 4:
    609                 return t.to(device, dtype if t.is_floating_point() else None, non_blocking, memory_format=convert_to_format)
--> 610             return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
    611 
    612         return self._apply(convert)

RuntimeError: CUDA error: device-side assert triggered
abhishekkrthakur commented 3 years ago

Code provided in examples works quite well. This seems like some problem with the model. I cant say without having data and full code to reproduce the error :)

abhishekkrthakur commented 3 years ago

@nvnvashisth I just added a multi-class classification example (flower classification with 104 classes). It might be useful for you: https://github.com/abhishekkrthakur/tez/blob/main/examples/image_classification/flower_classification.py

Let me know if it still doesnt work.

nvnvashisth commented 3 years ago

Code provided in examples works quite well. This seems like some problem with the model. I cant say without having data and full code to reproduce the error :)

I have the code privately on your twitter (DM). That's the only way I could figure to reach you privately.

nvnvashisth commented 3 years ago

@nvnvashisth I just added a multi-class classification example (flower classification with 104 classes). It might be useful for you: https://github.com/abhishekkrthakur/tez/blob/main/examples/image_classification/flower_classification.py

Let me know if it still doesnt work.

I'll give it a try. Thanks

nvnvashisth commented 3 years ago

@abhishekkrthakur it is so weird. I didn't really change anything and it started working. No more cuda error. Thanks for quick support.

abhishekkrthakur commented 3 years ago

wow. maybe you updated torch?

nvnvashisth commented 3 years ago

Not really, I was running in colab, was using the default one.

agg-shambhavi commented 3 years ago

It seems like your image is RGBA. Can you convert it to RGB? or do you have to use RGBA? If latter, then try writing your own dataloader.

@abhishekkrthakur Thank you! This helped me to solve the above error.