PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

PaddleSeg-release-2.9/contrib/MedicalSeg执行模型转换报错,MSD-brain #64261

Open kulongwei opened 3 months ago

kulongwei commented 3 months ago

请提出你的问题 Please ask your question

按照官网数据,一步步布置好了数据核对环境,但是貌似执行下来会报错:在PaddleSeg-release-2.9/contrib/MedicalSeg下执行:python train.py --config configs/msd_brain_seg/unetr_msd_brain_seg_1e-4.yml --save_dir save_dir --save_interval 500 --log_iters 100 --num_workers 6 --do_eval --use_vdl --keep_checkpoint_max 5 --seed 0 >> $save_dir/train.log

注意:paddle2.4.2正常在DCU硬件上运行!

报错信息: which: no nvcc in (/opt/dtk-23.04/bin:/opt/dtk-23.04/llvm/bin:/opt/dtk-23.04/hip/bin:/opt/dtk-23.04/hip/bin/hipify:/data01/tools/miniconda3/envs/synapse/bin:/data01/tools/miniconda3/condabin:/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin/x86_64:/root/bin) /data01/ww/synapse/PaddleSeg-release-2.9/contrib/MedicalSeg/medicalseg/cvlibs/config.py:455: UserWarning: Warning: The data dir now is /data01/ww/synapse/PaddleSeg-release-2.9/contrib/MedicalSeg/data/, you should change the data_root in the global.yml if this directory didn't have enough space .format(absolute_data_dir)) W0514 00:37:00.813803 9466 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 90.0, Driver API Version: 50400.0, Runtime API Version: 50400.0 which: no nvcc in (/opt/dtk-23.04/bin:/opt/dtk-23.04/llvm/bin:/opt/dtk-23.04/hip/bin:/opt/dtk-23.04/hip/bin/hipify:/data01/tools/miniconda3/envs/synapse/bin:/data01/tools/miniconda3/condabin:/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin/x86_64:/root/bin) which: no nvcc in (/opt/dtk-23.04/bin:/opt/dtk-23.04/llvm/bin:/opt/dtk-23.04/hip/bin:/opt/dtk-23.04/hip/bin/hipify:/data01/tools/miniconda3/envs/synapse/bin:/data01/tools/miniconda3/condabin:/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin/x86_64:/root/bin) 已放弃(吐核)

Bobholamovic commented 3 months ago

你好,请问这是$save_dir/train.log的内容还是终端的输出呀?

kulongwei commented 3 months ago

你好,请问这是$save_dir/train.log的内容还是终端的输出呀? 这里写错了 ,应该是内容 ,你们复现成功了吗 ?

Bobholamovic commented 3 months ago

我们目前并没有可用于复现错误的DCU环境,可能只能基于你提供的错误日志为你提供建议。从指令上来看,标准输出被重定向到了日志文件中,但标准错误未被重定向,因此通常可以在终端输出找到报错信息~