NG_OK运行报错 - Githubissues

MiniBullLab commented 2 years ago

运行命令：

python3 easy_tools/easy_ai.py -t NG_OK -g 0 -i /easy_data/dataset/blt_classify/ImageSets/train.txt -v /easy_data/dataset/blt_classify/ImageSets/val.txt

报错如下：

/easy_data/easy_ai/easyai/loss/cls/ce2d_loss.py:115: UserWarning: Using a target size (torch.Size([16])) that is different to the input size (torch.Size([16, 1])) is deprecated. Please ensure they have the same size.
  reduction=self.reduction)
2021-10-22 10:30:28,344 ERROR   [classify_train.py, 36] output with shape [1, 512] doesn't match the broadcast shape [17, 512]

lpj0822 commented 2 years ago

这个测试一下

foww-0001 commented 2 years ago

2021-10-26 03:30:01,348 INFO    [common_train.py, 135] Train image count is : 71
2021-10-26 03:30:12,410 ERROR   [classify_train.py, 37] Traceback (most recent call last):
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 32, in train
    self.train_epoch(epoch, self.lr_scheduler, self.dataloader)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 47, in train_epoch
    loss_value = self.compute_backward(batch_data, index)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 62, in compute_backward
    self.optimizer.step()
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/sgd.py", line 112, in step
    p.add_(d_p, alpha=-group['lr'])
RuntimeError: output with shape [1, 512] doesn't match the broadcast shape [17, 512]

2021-10-26 03:30:12,410 ERROR   [classify_train.py, 38] output with shape [1, 512] doesn't match the broadcast shape [17, 512]
2021-10-26 03:30:12,412 DEBUG   [train_task.py, 55] {'data_channel': 3, 'image_size': [224, 224], 'mean': [0.5070751592371323, 0.48654887331495095, 0.4409178433670343], 'normalize_type': -1, 'resize_type': 1, 'std': [0.2666410733740041, 0.2666410733740041, 0.2666410733740041]}
2021-10-26 03:30:12,412 DEBUG   [model_factory.py, 32] {'type': 'binarynet', 'data_channel': 3, 'class_number': 2}
2021-10-26 03:30:12,412 DEBUG   [model_factory.py, 61] {'type': 'binarynet', 'data_channel': 3, 'class_number': 2}
2021-10-26 03:30:12,412 DEBUG   [backbone_factory.py, 27] {'data_channel': 3, 'type': 'resnet18'}
2021-10-26 03:30:12,474 DEBUG   [loss_factory.py, 27] {'type': 'bceLoss', 'weight_type': 0, 'reduction': 'mean', 'ignore_index': 250}
2021-10-26 03:30:12,592 WARNING [torch_model_process.py, 72] Error(s) in loading state_dict for BinaryClassNet:
    size mismatch for fcLayer_2.linear.weight: copying a param with shape torch.Size([17, 512]) from checkpoint, the shape in current model is torch.Size([1, 512]).
    size mismatch for fcLayer_2.linear.bias: copying a param with shape torch.Size([17]) from checkpoint, the shape in current model is torch.Size([1]).
2021-10-26 03:30:13,321 INFO    [easy_ai.py, 71] easyai process end!

foww-0001 commented 2 years ago

定位到问题是我们classnet和NG_OK训练保存的模型都为cls_latest.pt，如果在训练完classnet后再训练NG_OK则会因为类别不同导入classnet的模型而报错。

foww-0001 commented 2 years ago

2021-10-26 06:02:06,849 ERROR   [train_task.py, 42] Traceback (most recent call last):
  File "/easy_data/easy_ai/easyai/train_task.py", line 39, in train
    task.train(self.train_path, self.val_path)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 35, in train
    self.test(val_path, epoch, save_model_path)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 77, in test
    precision, average_loss = self.classify_test.test(epoch)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_test.py", line 35, in test
    batch_data['label'].to(prediction.device))
  File "/easy_data/easy_ai/easyai/evaluation/cls/classify_accuracy.py", line 28, in torch_eval
    precision = self.accuracy(output, target, self.param_top)
  File "/easy_data/easy_ai/easyai/evaluation/cls/classify_accuracy.py", line 77, in accuracy
    pred = (output >= self.threshold).astype(int)
AttributeError: 'Tensor' object has no attribute 'astype'

2021-10-26 06:02:06,849 ERROR   [train_task.py, 43] 'Tensor' object has no attribute 'astype'

foww-0001 commented 2 years ago

pred = (output >= self.threshold).astype(int) 修改为 pred = (output >= self.threshold).int() 或 pred = (output >= self.threshold).to(torch.int32)

foww-0001 commented 2 years ago

修改后可以正常运行。

lpj0822 commented 2 years ago

已经修改

foww-0001 commented 2 years ago

拉取最新代码，报错：

/easy_data/easy_ai/easyai/loss/cls/ce2d_loss.py:115: UserWarning: Using a target size (torch.Size([16])) that is different to the input size (torch.Size([16, 1])) is deprecated. Please ensure they have the same size.
  reduction=self.reduction)
2021-10-26 08:33:10,550 ERROR   [train_task.py, 42] Traceback (most recent call last):
  File "/easy_data/easy_ai/easyai/train_task.py", line 39, in train
    task.train(self.train_path, self.val_path)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 32, in train
    self.train_epoch(epoch, self.lr_scheduler, self.dataloader)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 47, in train_epoch
    loss_value = self.compute_backward(batch_data, index)
  File "/easy_data/easy_ai/easyai/tasks/cls/classify_train.py", line 56, in compute_backward
    self.loss_backward(loss)
  File "/easy_data/easy_ai/easyai/tasks/utility/common_train.py", line 105, in loss_backward
    if self.train_task_config.sparse_config.get('enable_sparse', None):
AttributeError: 'NoneType' object has no attribute 'get'

2021-10-26 08:33:10,550 ERROR   [train_task.py, 43] 'NoneType' object has no attribute 'get'

lpj0822 commented 2 years ago

已经修改

foww-0001 commented 2 years ago

拉取最新分支可以正常训练。

MiniBullLab / easy_ai

NG_OK运行报错 #183