单机多卡mpi环境下使用jittor的amp时报错

yykmeng commented 2 months ago

Describe the bug

我使用了官方提供的mpi命令，具体是UDA_VISIBLE_DEVICES="0,1" mpirun -np 2 python myfile.py。训练循环使用了jt.flag_scorp(auto_mixed_precision_level=5)包裹，此时发生了错误，估计是不同显卡设备上的精度不一致导致的？

Full Log

[i 0615 12:22:27.345104 60 compiler.py:956] Jittor(1.3.8.5) src: /home2/ykm2023/applications/miniconda3/envs/dl/lib/python3.12/site-packages/jittor
[i 0615 12:22:27.346163 60 compiler.py:957] g++ at /usr/bin/g++(11.4.0)
[i 0615 12:22:27.346250 60 compiler.py:958] cache_path: /home2/ykm2023/.cache/jittor/jt1.3.8/g++11.4.0/py3.12.3/Linux-5.15.0-1x36/IntelRCoreTMi9xc2/default
[i 0615 12:22:27.347645 60 __init__.py:411] Found /usr/local/cuda/bin/nvcc(11.6.124) at /usr/local/cuda/bin/nvcc.
[i 0615 12:22:27.348825 60 __init__.py:411] Found addr2line(2.38) at /usr/bin/addr2line.
[i 0615 12:22:27.413825 60 compiler.py:1011] cuda key:cu11.6.124_sm_86
[i 0615 12:22:27.665375 60 __init__.py:227] Total mem: 125.48GB, using 16 procs for compiling.
[i 0615 12:22:27.799614 60 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0615 12:22:27.888131 60 init.cc:62] Found cuda archs: [86,]
[i 0615 12:22:27.897721 60 __init__.py:411] Found mpicc(4.1.2) at /usr/bin/mpicc.
[i 0615 12:22:27.945483 88 compiler.py:956] Jittor(1.3.8.5) src: /home2/ykm2023/applications/miniconda3/envs/dl/lib/python3.12/site-packages/jittor
[i 0615 12:22:27.947295 88 compiler.py:957] g++ at /usr/bin/g++(11.4.0)
[i 0615 12:22:27.947432 88 compiler.py:958] cache_path: /home2/ykm2023/.cache/jittor/jt1.3.8/g++11.4.0/py3.12.3/Linux-5.15.0-1x36/IntelRCoreTMi9xc2/default
[i 0615 12:22:27.949056 88 __init__.py:411] Found /usr/local/cuda/bin/nvcc(11.6.124) at /usr/local/cuda/bin/nvcc.
[i 0615 12:22:27.950293 88 __init__.py:411] Found addr2line(2.38) at /usr/bin/addr2line.
[i 0615 12:22:28.050534 88 compiler.py:1011] cuda key:cu11.6.124_sm_86
[i 0615 12:22:28.299450 88 __init__.py:227] Total mem: 125.48GB, using 16 procs for compiling.
[i 0615 12:22:28.429223 88 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0615 12:22:28.493880 88 init.cc:62] Found cuda archs: [86,]
[i 0615 12:22:28.506025 88 __init__.py:411] Found mpicc(4.1.2) at /usr/bin/mpicc.
[i 0615 12:22:30.680091 88 cuda_flags.cc:49] CUDA enabled.
==> Logs will be saved at log/train/2024-06-15-12-22
[i 0615 12:22:30.743729 60 cuda_flags.cc:49] CUDA enabled.
Train: [6/0]:   0%|                                     | 0/500 [01:37<?, ?it/s]
Traceback (most recent call last):
  File "/home2/ykm2023/projects/brain_anomaly_detection/train.py", line 19, in <module>
    trainer.train()
  File "/home2/ykm2023/projects/brain_anomaly_detection/lib/base_trainer.py", line 66, in train
    self.train_epoch()
  File "/home2/ykm2023/projects/brain_anomaly_detection/lib/trainer.py", line 87, in train_epoch
    jt.sync_all()
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.sync_all)).

Types of your inputs are:
 self   = module,
 args   = (),

The function declarations are:
 void sync_all(bool device_sync=false)

Failed reason:[f 0615 12:24:09.721441 60 parallel_compiler.cc:331] Error happend during compilation:
 [Error] source file location:/home2/ykm2023/.cache/jittor/jt1.3.8/g++11.4.0/py3.12.3/Linux-5.15.0-1x36/IntelRCoreTMi9xc2/default/cu11.6.124_sm_86/jit/__opkey0_broadcast_to__Tx_float32__DIM_3__BCAST_1__opkey1_broadcast_to__Tx_float16__DIM_3____hash_ade349e1bdaef940_op.cc
Compile fused operator(17/38)failed:[Op(13790:1:1:1:i1:o1:s0,broadcast_to->13791),Op(13788:1:1:1:i1:o1:s0,broadcast_to->13789),Op(13792:2:1:1:i2:o1:s0,binary.multiply->13793),Op(13794:1:1:1:i1:o1:s0,reduce.add->13795),]

Reason: [f 0615 12:23:14.860971 16:C14 cublas_matmul_op.cc:33] Check failed: a->dtype().dsize() == b->dtype().dsize()  Something wrong... Could you please report this issue?
 type of two inputs should be the same

Train: [6/0]:   0%|                                     | 0/500 [02:07<?, ?it/s]
Traceback (most recent call last):
  File "/home2/ykm2023/projects/brain_anomaly_detection/train.py", line 19, in <module>
    trainer.train()
  File "/home2/ykm2023/projects/brain_anomaly_detection/lib/base_trainer.py", line 66, in train
    self.train_epoch()
  File "/home2/ykm2023/projects/brain_anomaly_detection/lib/trainer.py", line 87, in train_epoch
    jt.sync_all()
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.sync_all)).

Types of your inputs are:
 self   = module,
 args   = (),

The function declarations are:
 void sync_all(bool device_sync=false)

Failed reason:[f 0615 12:24:39.786597 88 parallel_compiler.cc:331] Error happend during compilation:
 [Error] source file location:/home2/ykm2023/.cache/jittor/jt1.3.8/g++11.4.0/py3.12.3/Linux-5.15.0-1x36/IntelRCoreTMi9xc2/default/cu11.6.124_sm_86/jit/__opkey0_broadcast_to__Tx_float32__DIM_3__BCAST_1__opkey1_broadcast_to__Tx_float16__DIM_3____hash_ade349e1bdaef940_op.cc
Compile fused operator(17/38)failed:[Op(13790:1:1:1:i1:o1:s0,broadcast_to->13791),Op(13788:1:1:1:i1:o1:s0,broadcast_to->13789),Op(13792:2:1:1:i2:o1:s0,binary.multiply->13793),Op(13794:1:1:1:i1:o1:s0,reduce.add->13795),]

Reason: [f 0615 12:24:09.858298 16:C1 cublas_matmul_op.cc:33] Check failed: a->dtype().dsize() == b->dtype().dsize()  Something wrong... Could you please report this issue?
 type of two inputs should be the same

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[61448,1],0]
  Exit code:    1
--------------------------------------------------------------------------

yykmeng commented 2 months ago

好吧，可能不是多卡和amp的冲突，改成单卡进程后也出现了这个问题代码如上，报告说是jt.sync_all()出现错误

MenghaoGuo commented 2 months ago

jt.sync_all 是一个用于同步的函数，可能是在调用这步时前面出现的问题而不是这步。现在提供的信息有点少，是否方便提供更多的代码，以及是否方便告知一下是在进行什么任务时出现的问题？

yykmeng commented 2 months ago

jt.sync_all 是一个用于同步的函数，可能是在调用这步时前面出现的问题而不是这步。现在提供的信息有点少，是否方便提供更多的代码，以及是否方便告知一下是在进行什么任务时出现的问题？

很抱歉，由于我是远程开发而且服务器断电还未修复，所以暂时无法提供更多的代码。

现在已有的信息是，我在进行一个医学图像处理的研究工作，数据维度是四维(channel, slice, width, height)，使用了conv3d模块。

由于算力有限，我想开启jittor的amp看看能省下多少显存，于是出现了这个问题。

还有一点，当我设置auto_mixed_precision_level=3时，并不会报错，能正常训练。一旦设置auto_mixed_precision_level>3，也就是引入Float16以后，程序就会报错。我也尝试了在上面提供的代码中使用img.float_auto()等，报错信息中是包含了说数据类型是Float16的，数据转换可能没问题，但是后面就会出现 a->dtype().dsize() == b->dtype().dsize()。

服务器恢复了，我又尝试了运行，结果如下：

Traceback (most recent call last):
  File "/home2/ykm2023/projects/brain_anomaly_detection/train.py", line 20, in <module>
    trainer.train()
  File "/home2/ykm2023/projects/brain_anomaly_detection/lib/base_trainer.py", line 66, in train
    self.train_epoch()
  File "/home2/ykm2023/projects/brain_anomaly_detection/lib/trainer.py", line 87, in train_epoch
    loss.sync()
RuntimeError: [f 0617 11:30:12.815932 20 executor.cc:686] 
Execute fused operator(4/326) failed. 
[JIT Source]: /home2/ykm2023/.cache/jittor/jt1.3.8/g++11.4.0/py3.12.3/Linux-5.15.0-1x36/IntelRCoreTMi9xc2/default/cu11.6.124_sm_86/jit/cudnn_conv3d__Tx_float16__Ty_float16__Tw_float16__JIT_1__JIT_cuda_1__index_t_int32_hash_a6c005a8160b7c80_op.cc 
[OP TYPE]: cudnn_conv3d 
[Input]: float16[1,4,160,192,128,], float16[32,4,3,3,3,]encoder.start_conv.weight, 
[Output]: float16[1,32,160,192,128,], 
[Async Backtrace]: not found, please set env JT_SYNC=1, trace_py_var=3 
[Reason]: [f 0617 11:30:12.815673 20 cudnn_conv3d__Tx_float16__Ty_float16__Tw_float16__JIT_1__JIT_cuda_1__index_t_int32_hash_a6c005a8160b7c80_op.cc:386] Check failed: best_algo_idx!=-1  Something wrong... Could you please report this issue?

**********
Async error was detected. To locate the async backtrace and get better error report, please rerun your code with two enviroment variables set:
>>> export JT_SYNC=1
>>> export trace_py_var=3

一些代码我贴在下面。顺带一提，我的python版本是3.12，jittor说JT_SYNC=1, trace_py_var=3只支持到python3.11。

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", type=str)
    args = parser.parse_args()

    f = open(f"{args.config}")
    opt = EasyDict(yaml.safe_load(f))

    with jt.flag_scope(auto_mixed_precision_level=4):
        trainer = Trainer(opt)
        trainer.train()

    def train_epoch(self):
        self.model.train()
        with tqdm(total=len(self.train_loader), ncols=80) as bar:
            for i, data in enumerate(self.train_loader):
                bar.set_description(f"Train: [{self.epoch}/{self.global_step}]")
                img, target = data
                img = img.float_auto()
                target = target.float_auto()

                self.optimizer.zero_grad()
                output, recon_input, mu, varlog = self.model(img)

                loss, _ = self.losser(output, recon_input, mu, varlog, img, target)
                loss.sync()

                self.loss = loss.item()
                if jt.rank == 0:
                    jt.fetch(loss, lambda loss: self.logger.summary(self.global_step, log_dict={"train_loss": loss.item()}))

                # jt.sync_all(True)
                if self.global_step % 5 == 0:
                    loss.sync()
                if jt.in_mpi:
                    loss = loss.mpi_all_reduce('mean')

                self.optimizer.backward(loss)
                self.optimizer.step()

                if jt.rank == 0:
                    self.logger.summary(
                        self.global_step, log_dict={"train_loss": self.loss}
                    )

                self.global_step += 1

                bar.set_postfix(loss=f"{self.loss:.3}")
                bar.update(1)

MenghaoGuo commented 2 months ago

目前 FP16 有部分算子无法直接完成转化，我们近期会修复这个 bug，谢谢

yykmeng commented 2 months ago

目前 FP16 有部分算子无法直接完成转化，我们近期会修复这个 bug，谢谢

所有这是因为一些未完善的功能导致的问题。

谢谢您的解答，我将关闭该Issue。

Jittor / jittor

单机多卡mpi环境下使用jittor的amp时报错 #559

Describe the bug

Full Log