RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.grad)).

linengcs commented 1 month ago

Describe the bug

在执行self.optimizer.step(main_loss)时报错如下：

Traceback (most recent call last):
  File "train_edge.py", line 506, in <module>
    trainer.train_edge()
  File "train_edge.py", line 290, in train_edge
    self.optimizer.step(main_loss)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 305, in step
    self.pre_step(loss, retain_graph=False)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 220, in pre_step
    self.backward(loss, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 173, in backward
    grads = jt.grad(loss, params_has_grad, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/__init__.py", line 445, in grad
    return core.grad(loss, targets, retain_graph)
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.grad)).

其中optimizer选择的是SGD

self.optimizer = jt.optim.SGD(params_list,
                                      lr=args.lr,
                                      momentum=args.momentum,
                                      weight_decay=args.weight_decay)

传入的loss：jt.Var([4.33252289], dtype=float64)

Full Log

(fdlnet_j) llf@XY-TITAN-RTX:/home/ubuntu/hdd2/llf/fdlnet_jittor/scripts$ python train_edge.py --model fdlnet --backbone resnet50 --dataset night --aux
[i 0705 15:10:15.537224 52 compiler.py:956] Jittor(1.3.8.5) src: /home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor
[i 0705 15:10:15.545380 52 compiler.py:957] g++ at /usr/bin/g++(5.5.0)
[i 0705 15:10:15.545582 52 compiler.py:958] cache_path: /home/llf/.cache/jittor/jt1.3.8/g++5.5.0/py3.8.19/Linux-4.15.0-1x37/IntelRXeonRGolx4e/default
[i 0705 15:10:15.579173 52 install_cuda.py:93] cuda_driver_version: [12, 1]
[i 0705 15:10:15.579814 52 install_cuda.py:81] restart /home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/bin/python ['train_edge.py', '--model', 'fdlnet', '--backbone', 'resnet50', '--dataset', 'night', '--aux']
[i 0705 15:10:15.903714 16 compiler.py:956] Jittor(1.3.8.5) src: /home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor
[i 0705 15:10:15.910872 16 compiler.py:957] g++ at /usr/bin/g++(5.5.0)
[i 0705 15:10:15.911057 16 compiler.py:958] cache_path: /home/llf/.cache/jittor/jt1.3.8/g++5.5.0/py3.8.19/Linux-4.15.0-1x37/IntelRXeonRGolx4e/default
[i 0705 15:10:15.944564 16 install_cuda.py:93] cuda_driver_version: [12, 1]
[i 0705 15:10:15.954342 16 __init__.py:411] Found /home/llf/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc(11.2.152) at /home/llf/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc.
[i 0705 15:10:16.037728 16 __init__.py:411] Found gdb(8.1.1) at /usr/bin/gdb.
[i 0705 15:10:16.046927 16 __init__.py:411] Found addr2line(2.30) at /usr/bin/addr2line.
[i 0705 15:10:16.301866 16 compiler.py:1011] cuda key:cu11.2.152_sm_75
[i 0705 15:10:16.767486 16 __init__.py:227] Total mem: 125.56GB, using 16 procs for compiling.
[i 0705 15:10:16.866903 16 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0705 15:10:17.003635 16 init.cc:62] Found cuda archs: [75,]
[i 0705 15:10:17.038976 16 __init__.py:411] Found mpicc(2.1.1) at /usr/bin/mpicc.
[i 0705 15:10:18.680663 16 cuda_flags.cc:49] CUDA enabled.
2024-07-05 15:10:18,788 test INFO: Using 1 GPUs
2024-07-05 15:10:18,788 test INFO: Namespace(att_weight=0.01, aux=True, aux_weight=0.4, backbone='resnet50', base_size=512, batch_size=2, best_recode={'epoch': -1, 'mean_iu': 0}, crop_size=384, dataset='night', date_str='2024_07_05_15_10_18', device='cuda', distributed=False, edge_weight=0.01, epochs=260, flip=False, joint_edgeseg_loss=False, jpu=False, l2_weight=0, last_recode={}, local_rank=0, log_dir='../runs/logs/', log_iter=20, lr=0.005, manual_seed=40171, model='fdlnet', momentum=0.9, no_cuda=False, num_gpus=1, resume=None, save_dir='../runs/ckpt', save_epoch=20, seg_weight=1.0, skip_val=False, start_epoch=0, use_ohem=False, val_epoch=1, warmup_factor=0.3333333333333333, warmup_iters=0, warmup_method='linear', weight_decay=0.0005, workers=12)
Found 2998 images in the folder ../../datasets/night/images/train
Found 1299 images in the folder ../../datasets/night/images/val
[w 0705 15:10:19.370889 16 nn.py:2280]  The `Parameter` interface isn't needed in Jittor, this interface
does nothings and it is just used for compatible.

A Jittor Var is a Parameter
when it is a member of Module, if you don't want a Jittor
Var menber is treated as a Parameter, just name it startswith
underscore `_`.

2024-07-05 15:10:19,373 test INFO: Start training, Total Epochs: 260 = Total Iterations 389740
type of threshold_index: <class 'jittor.jittor_core.Var'>, shape of threshold_index: [1,]
type of threshold_index: <class 'jittor.jittor_core.Var'>, shape of threshold_index: [1,]
type of threshold_index: <class 'jittor.jittor_core.Var'>, shape of threshold_index: [1,]

Compiling Operators(1/1) used: 2.96s eta:    0s

Compiling Operators(1/1) used: 2.95s eta:    0s
Traceback (most recent call last):
  File "train_edge.py", line 506, in <module>
    trainer.train_edge()
  File "train_edge.py", line 290, in train_edge
    self.optimizer.step(main_loss)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 305, in step
    self.pre_step(loss, retain_graph=False)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 220, in pre_step
    self.backward(loss, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/optim.py", line 173, in backward
    grads = jt.grad(loss, params_has_grad, retain_graph)
  File "/home/ubuntu/hdd2/llf/miniconda3/envs/fdlnet_j/lib/python3.8/site-packages/jittor/__init__.py", line 445, in grad
    return core.grad(loss, targets, retain_graph)
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.grad)).

Types of your inputs are:
 self   = module,
 args   = (Var, list, bool, ),

The function declarations are:
 vector<VarHolder*> _grad(VarHolder* loss, const vector<VarHolder*>& targets, bool retain_graph=true)

Failed reason:[f 0705 15:10:28.107652 16 cublas_batched_matmul_op.cc:34] Check failed: a->dtype().dsize() == b->dtype().dsize()  Something wrong... Could you please report this issue?
 type of two inputs should be the same

Minimal Reproduce

        for iteration, (images, targets, edge, _) in enumerate(self.train_dataloader):
            batch_pixel_size = images.size(0) * images.size(2) * images.size(3)

            # print(images.shape, targets.shape)
            iteration = iteration + 1

            main_loss = None
            loss_dict = self.model(images, gts=(targets, edge))

            if args.seg_weight > 0:
                log_seg_loss = loss_dict['seg_loss'].mean().clone().detach()
                train_seg_loss.update(log_seg_loss.item(), batch_pixel_size)
                main_loss = loss_dict['seg_loss']

            if args.aux_weight > 0:
                log_aux_loss = loss_dict['aux_loss'].mean().clone().detach()
                train_aux_loss.update(log_aux_loss.item(), batch_pixel_size)
                main_loss += loss_dict['aux_loss']

            if args.att_weight > 0:
                log_att_loss = loss_dict['att_loss'].mean().clone().detach()
                train_att_loss.update(log_att_loss.item(), batch_pixel_size)
                main_loss += loss_dict['att_loss']

            main_loss = main_loss.mean()
            log_main_loss = main_loss.clone().detach()

            train_main_loss.update(log_main_loss.item(), batch_pixel_size)

            self.optimizer.step(main_loss)

LDYang694 commented 1 month ago

数据的运算过程中出现了float64和float32混用的地方，可能的原因出现在将numpy的array转换成jt的Var，因为np 初始化array默认是float64

linengcs commented 1 month ago

我调整后，全程调试检查了数据类型都是float32，但是还是一样的报错

linengcs commented 1 month ago

已解决：将self.optimizer.step(main_loss)修改为： main_loss_value = main_loss.item() self.optimizer.step(main_loss_value) 即可，因为 optimizer.step() 期望接收一个标量（单个数值）的损失值，而不是一个 Var

补充一句：用GPT4协助找了一个礼拜的bug，涉及到方方面面，每次回答都是重复性的，感觉陷入圈套，遂想尝试其他模型得到新的想法，开通了Gemini Pro，把issue全部丢过去，Gemini一次即解决

Jittor / jittor