PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.26k stars 5.6k forks source link

未知错误,报错不明显 #55666

Closed soyons closed 1 year ago

soyons commented 1 year ago

问题描述 Please describe your issue

python3.7.0 v100-32g NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
paddlepaddle-gpu 2.4.2

Mon Jul 24 19:59:20 2023[1,0]:Total params: 1172702 Mon Jul 24 19:59:20 2023[1,0]:Trainable params: 166536 Mon Jul 24 19:59:20 2023[1,0]:Non-trainable params: 1006166 Mon Jul 24 19:59:20 2023[1,0]:Training with custom optimizer Mon Jul 24 19:59:26 2023[1,0]:/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:277: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.float32, the right dtype will convert to paddle.int64 Mon Jul 24 19:59:26 2023[1,0]: .format(lhs_dtype, rhs_dtype, lhs_dtype)) Mon Jul 24 19:59:26 2023[1,0]:/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:277: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.int64, the right dtype will convert to paddle.float32 Mon Jul 24 19:59:26 2023[1,0]: .format(lhs_dtype, rhs_dtype, lhs_dtype)) Mon Jul 24 19:59:26 2023[1,0]:/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:277: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int32, but right dtype is paddle.int64, the right dtype will convert to paddle.int32 Mon Jul 24 19:59:26 2023[1,0]: .format(lhs_dtype, rhs_dtype, lhs_dtype)) Mon Jul 24 19:59:28 2023[1,0]:epoch:0,step:10,rank:0,loss:0.0238,lr:0.000010 Mon Jul 24 19:59:28 2023[1,0]: decision_loss:0.023788, decision: 42, point: 50 Mon Jul 24 19:59:28 2023[1,0]: binary_gaussian:-0.8112,kf_lane_loss:2.6619, anchor_loss:0.0000,point_loss:58.3817 Mon Jul 24 19:59:28 2023[1,0]: cross_num: 16.0, cross_loss:0.6736 Mon Jul 24 19:59:28 2023[1,0]: close_l_num: 8.0, close_l_loss:0.5307 Mon Jul 24 19:59:28 2023[1,0]: kf_top1_lane_accuracy:0.6600,kf_top3_lane_accuracy:0.8800 Mon Jul 24 19:59:32 2023[1,0]:epoch:0,step:20,rank:0,loss:0.0104,lr:0.000010 Mon Jul 24 19:59:32 2023[1,0]: decision_loss:0.010432, decision: 36, point: 93 Mon Jul 24 19:59:32 2023[1,0]: binary_gaussian:0.4726,kf_lane_loss:6.7281, anchor_loss:0.0000,point_loss:41.7945 Mon Jul 24 19:59:32 2023[1,0]: cross_num: 16.0, cross_loss:0.8788 Mon Jul 24 19:59:32 2023[1,0]: close_l_num: 20.0, close_l_loss:1.7077 Mon Jul 24 19:59:32 2023[1,0]: kf_top1_lane_accuracy:0.7527,kf_top3_lane_accuracy:0.9032 Mon Jul 24 19:59:34 2023[1,0]:epoch:0,step:30,rank:0,loss:0.0274,lr:0.000010 Mon Jul 24 19:59:34 2023[1,0]: decision_loss:0.027361, decision: 40, point: 98 Mon Jul 24 19:59:34 2023[1,0]: binary_gaussian:0.2453,kf_lane_loss:5.5190, anchor_loss:0.0000,point_loss:29.8723 Mon Jul 24 19:59:34 2023[1,0]: cross_num: 12.0, cross_loss:0.6826 Mon Jul 24 19:59:34 2023[1,0]: close_l_num: 11.0, close_l_loss:0.6953 Mon Jul 24 19:59:34 2023[1,0]: kf_top1_lane_accuracy:0.7041,kf_top3_lane_accuracy:0.8980 Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:34 value:-nan Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:0 value:0.498221 Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:1 value:0.024911 Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:2 value:0.498221 Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:26 value:-nan Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:31 value:-nan Mon Jul 24 19:59:36 2023[1,0]:In block 0, there has 3,0,36 nan,inf,num Mon Jul 24 19:59:36 2023[1,0]:Error: /paddle/paddle/fluid/framework/details/nan_inf_utils_detail.cu:105 Assertion false failed. ===ERROR: in [op=gather] [tensor=] find nan or inf=== Mon Jul 24 19:59:36 2023[1,0]:terminate called after throwing an instance of 'phi::enforce::EnforceNotMet' Mon Jul 24 19:59:36 2023[1,0]: what(): (External) CUDA error(719), unspecified launch failure. Mon Jul 24 19:59:36 2023[1,0]: [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc:80) Mon Jul 24 19:59:36 2023[1,0]: Mon Jul 24 19:59:36 2023[1,0]: Mon Jul 24 19:59:36 2023[1,0]: Mon Jul 24 19:59:36 2023[1,0]:-------------------------------------- Mon Jul 24 19:59:36 2023[1,0]:C++ Traceback (most recent call last): Mon Jul 24 19:59:36 2023[1,0]:-------------------------------------- Mon Jul 24 19:59:36 2023[1,0]:0 egr::Backward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool) Mon Jul 24 19:59:36 2023[1,0]:1 egr::RunBackward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&) Mon Jul 24 19:59:36 2023[1,0]:2 egr::GradNodeAccumulation::operator()(paddle::small_vector<std::vector<paddle::experimental::Tensor, std::allocator >, 15u>&, bool, bool) Mon Jul 24 19:59:36 2023[1,0]:3 egr::GradNodeAccumulation::ApplyReduceHooks() Mon Jul 24 19:59:36 2023[1,0]:4 paddle::distributed::EagerReducer::AddDistHook(unsigned long) Mon Jul 24 19:59:36 2023[1,0]:5 paddle::distributed::EagerReducer::MarkVarReady(unsigned long, bool) Mon Jul 24 19:59:36 2023[1,0]:6 paddle::distributed::EagerReducer::FinalizeBackward() Mon Jul 24 19:59:36 2023[1,0]:7 paddle::experimental::Tensor::reset() Mon Jul 24 19:59:36 2023[1,0]:8 std::_Sp_counted_ptr_inplace<phi::DenseTensor, std::allocator, (gnu_cxx::_Lock_policy)2>::_M_dispose() Mon Jul 24 19:59:36 2023[1,0]:9 std::_Sp_counted_deleter<phi::Allocation, std::function<void (phi::Allocation)>, std::allocator, (gnu_cxx::_Lock_policy)2>::_M_dispose() Mon Jul 24 19:59:36 2023[1,0]:10 paddle::memory::allocation::StatAllocator::FreeImpl(phi::Allocation) Mon Jul 24 19:59:36 2023[1,0]:11 paddle::memory::allocation::RetryAllocator::FreeImpl(phi::Allocation) Mon Jul 24 19:59:36 2023[1,0]: Mon Jul 24 19:59:36 2023[1,0]:---------------------- Mon Jul 24 19:59:36 2023[1,0]:Error Message Summary: Mon Jul 24 19:59:36 2023[1,0]:---------------------- Mon Jul 24 19:59:36 2023[1,0]:FatalError: Process abort signal is detected by the operating system. Mon Jul 24 19:59:36 2023[1,0]: [TimeInfo: Aborted at 1690199976 (unix time) try "date -d @1690199976" if you are using GNU date ] Mon Jul 24 19:59:36 2023[1,0]: [SignalInfo: SIGABRT (@0x4a22) received by PID 18978 (TID 0x7f1d3be0d640) from PID 18978 ] Mon Jul 24 19:59:36 2023[1,0]: Mon Jul 24 19:59:37 2023[1,0]:LAUNCH INFO 2023-07-24 19:59:37,827 Pod failed Mon Jul 24 19:59:37 2023[1,0]:LAUNCH ERROR 2023-07-24 19:59:37,828 Container failed !!! Mon Jul 24 19:59:37 2023[1,0]:Container rank 0 status failed cmd ['/usr/local/bin/python3', '-u', 'multipath/predecision_main.py', '--config', 'multipath/conf/predecision.conf', '--data_path', 'afs/jn_bicycle_rule/junction_gostraight'] code -6 log log/workerlog.0 Mon Jul 24 19:59:37 2023[1,0]:env {'PYTHONPATH': '.', 'CPLUS_INCLUDE_PATH': '/usr/local/python2.7.15/include/python2.7:/usr/local/python3.5.1/include/python3.5:', 'PSERVERS_NUM': '', 'CUDNN_VERSION': '7.6.5.32', 'SYS_REAL_DOWNLOAD': '1', 'KUBE_DEPENDENCY': '/home/kubernetes/dependency', 'CGPU1_SHAREMODE': '7', 'SYS_OUTPUT_PATH': '/user/ad-pnc/prediction/train_result//liqinghai01/job-0bb64be60f44d054', 'TRAINER_GPU_CARD_COUNT': '2', 'PMIX_ID': '3340042241.0', 'OMPI_COMM_WORLD_RANK': '0', 'PMIX_NAMESPACE': '3340042241', 'PSERVER_IP_PORT_LIST': '', 'PREDICT_DATA_ID': '', 'GPU_RATIO': '1.0', 'SYS_VOLUME_MOUNT': '/root/paddlejob/workspace', 'PSERVER_MODEL_DIR': '', 'TERM_PROGRAM': 'vscode', 'HOSTNAME': 'yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com', 'PADDLE_USE_GPU': '1', 'HOSTNAME_TO_IP': '1', 'version': '2.7.15', 'NVIDIA_REQUIRE_CUDA': 'cuda>=10.2 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441', 'FS_UGI': 'ad-pnc-mix,ad-pnc-mix_passw0rd', 'PSERVER_IP_LIST': '', 'TERM': 'xterm-256color', 'SYS_JOB_VERSION': 'paddle-v2.4.0', 'KUBERNETES_PORT': 'tcp://11.1.0.1:443', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'CGPU1_COMPUTE_LIMIT': '100', 'FLAGS_call_stack_level': '1', 'OMPI_MCA_pmix': '^s1,s2,cray,isolated', 'OMPI_MCA_orte_ess_num_procs': '1', 'NCCL_SOCKET_IFNAME': 'xgbe0', 'COMBINED_OUTPUT_PATH': '/user/ad-pnc/prediction/train_result//liqinghai01/job-0bb64be60f44d054/', 'SYS_IS_ABACUS_CLUSTER': '0', 'HADOOP_HOME': '/root/paddlejob/hadoop-client/hadoop', 'PSERVERS': '', 'DICT_ID': '', 'TRIANER_IP_LIST': '10.127.19.149', 'SYS_DOWNLOAD_THREAD_NUM': '15', 'AIFLOW_URL': 'paddlecloud.baidu-int.com', 'CGPU0_SHAREMODE': '7', 'OMPI_MCA_ess_base_vpid': '0', 'IREPO_UPLOAD_MODEL_REPONAME': '', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'MPI_ON_K8S': '1', 'OMPI_MCA_ess_base_jobid': '3340042241', 'TERM_PROGRAM_VERSION': '1.75.1', 'LD_PRELOAD': '/usr/lib/x86_64-linux-gnu/coreutils/libstdbuf.so', 'IS_WHITELIST_JOB': '0', 'KUBERNETES_SERVICE_PORT': '443', 'TRAINER_INSTANCES': '10.127.19.149', 'PMIX_RANK': '0', 'TRAINER_PORTS_NUM': '2', 'OLDPWD': '/root/paddlejob', 'OMPI_MCA_orte_launch': '1', 'PADDLE_TRAINERS_NUM': '2', 'TRAINER_IP_PORT_LIST': '10.127.19.149:35368,10.127.19.149:35369', 'SYS_PRIVILEGE_SK': '2c49b0baef1b5467a2353b2821782f5b', 'FAULT_TOLERANT': 'False', 'K8S_ENTRY_FILE_NAME': 'trainer.py', 'CLUSTER_NAME': 'v100-32-0-cluster', 'K8S_ENTRY_CMD': 'python trainer.py', 'OMPI_MCA_orte_num_nodes': '1', 'NCCL_DEBUG_FILE': '/root/paddlejob/workspace/log/nccl.%p.log', 'VDL_LOG_PATH': 'afs://ad-pnc-mix:ad-pnc-mix_passw0rd@feilian.afs.baidu.com:9902/user/ad-pnc/prediction/train_result//liqinghai01/job-0bb64be60f44d054/visualdl_log_dir', 'KUBERNETES_SERVICE_HOST': '11.1.0.1', 'FLAGS_check_nan_inf': 'True', 'OMPI_COMM_WORLD_LOCAL_RANK': '0', 'OMPI_MCA_orte_hnp_uri': '3340042240.0;tcp://10.127.19.149,192.168.5.1:10114', 'PADDLE_TRAINER_ID': '0', 'SYS_VOLUME_PATH': '/home/work/containers/413a96eb-0844-451a-b2b4-fd2004bbe1be', 'OMPI_ARGV': 'submit/job.sh', 'OMPI_MCA_initial_wdir': '/root/paddlejob/workspace/env_run', 'TRAIN_DATA_ID': '', 'LC_ALL': 'en_US.UTF-8', 'SYS_USER_NAME': 'liqinghai01', 'SYS_USE_HADOOP_VFS': 'False', 'SYS_SERVICE_PORT': '8676', 'OUTPUT_PATH': '/user/ad-pnc/prediction/train_result/', 'WEBIDE_PLATFORM': 'PaddleCloud_Job', 'PADDLE_PORTS_NUM': '26', 'CODE_URI': '/user/ad-pnc/prediction/train_result//paddlecloud_code/junction_noturn_paddle_20230724193052516119.tar.gz', 'AFS_REMOTE_MOUNT_POINT': '/user/ad-pnc/prediction/', 'DISTRIBUTE_JOB_TYPE': 'PSERVER', 'ETCD_IMAGE': 'registry.baidu.com/bml/etcd:v3.2.1', 'AIFLOW_PLAT_NAME': 'pdc_backend', 'POD_0_PORTS': '35368,35369', 'RUNTIME_WORKDIR': '/root/paddlejob/workspace/env_run', 'JRE_HOME': '/jre', 'SYS_USER_ID': 'c1c8b81c-5ffeMon Jul 24 19:59:37 2023[1,0]:-5e1e-9874-d43de34cb602', 'OMPI_MCA_orte_timestamp_output': '1', 'IREPO_INIT_MODEL_SPACENAME': 'paddlecloud_space', 'NCCL_IB_TIMEOUT': '22', 'NCCL_IB_GID_INDEX': '3', 'TRAINER_MEMORY_LIMITS': '110Gi', 'NVIDIA_VISIBLE_DEVICES': 'GPU-599d41b0-3420-628e-57d2-e4a4034efb9c,GPU-6379c489-466f-2081-3c07-16bbe5ac5451', 'FAULT_TOLERANCE_ENV_PATH': '/root/paddlejob/fault_tolerance.env', 'LD_LIBRARY_PATH': '/opt/_internal/cpython-3.7.0/lib:/opt/conda/envs/py36/lib:/usr/local/lib:/usr/local/python2.7.15/lib:/opt/_internal/cpython-2.7.11-ucs4/lib:/opt/_internal/cpython-2.7.15-ucs4/lib:/opt/conda/envs/py27/lib:/opt/OpenBLAS:/:/opt/hadoop-client/hadoop/../java6/jre/lib/amd64:/opt/hadoop-client/hadoop/../java6/jre/lib/amd64/native_threads:/opt/hadoop-client/hadoop/../java6/jre/lib/amd64/server:/opt/hadoop-client/hadoop/lib/native/Linux-amd64-64:/usr/local/x86_64-pc-linux-gnu/lib:/home/opt/nvidia_lib:/usr/local/cuda/lib64:/usr/lib64:/usr/local/lib:/nccl/lib:/home/work/cudnn/cudnn_v7/cuda/lib64:/home/work/cudnn/cudnn_v6/cuda/lib64:/home/work/cudnn/cudnn_v5/cuda/lib64:/home/work/cuda-9.0/lib64:/home/work/cuda-8.0/lib64:/usr/lib64/mlnx_ofed/valgrind:/usr/lib/x86_64-linux-gnu/:/usr/local/lib/python2.7/site-packages/paddle/libs:$LD_LIBRARY_PATH:/root/paddlejob/hadoop-client/hadoop/libdfs/:/root/paddlejob/hadoop-client/hadoop/../java6/jre/lib/amd64:/root/paddlejob/hadoop-client/hadoop/../java6/jre/lib/amd64/native_threads:/root/paddlejob/hadoop-client/hadoop/../java6/jre/lib/amd64/server:/root/paddlejob/hadoop-client/hadoop/lib/native/Linux-amd64-64', 'OMPI_UNIVERSE_SIZE': '1', 'PSERVER_PORTS': '', 'PREDICT_DATA_PATH': '', 'JOB_CATEGORY': 'general', 'NAMESPACE': 'group-1f8cc06f-1968-dbae-e20c-7fa9216c1971', 'TRAIN_LOG_PATH': '/root/paddlejob/workspace/log/run.log', 'OMPI_MCA_mpi_yield_when_idle': '0', 'GPUTRAINER_ENDPOINTS': '10.127.19.149:35368,10.127.19.149:35369', 'PADDLE_CURRENT_ENDPOINT': '10.127.19.149:41404', 'WITH_AVX': 'ON', 'NVIDIA_VISIBLE_GPUS_UUID': 'GPU-599d41b0-3420-628e-57d2-e4a4034efb9c,GPU-6379c489-466f-2081-3c07-16bbe5ac5451', 'SYS_TMP_FILEPATH': '/user/paddlecloud/paddle-platform/buffer', 'SYS_API_HOST': 'paddlecloud.baidu-int.com', 'RESERVED_PORT_NUM': '3', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'OMPI_COMMAND': 'sh', 'OMPI_FILE_LOCATION': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817/0/0', 'OMPI_APP_CTX_NUM_PROCS': '1', 'SYS_URL_PREFIX': 'yq01-rdqa-bml27.yq01.baidu.com', 'DFS_USE_NATIVE_API': '0', 'AIFLOW_PLAT_TOKEN': '34bc918c410b1c9bae760c97ad1796fc', 'PADDLE_JOB_DIR': '/root/paddlejob', 'PADDLE_TRAINER_ENDPOINTS': '10.127.19.149:41404,10.127.19.149:41405', 'IREPO_PLAT_TOKEN': 'c8571e89-e7be-4aa7-9c9a-ac1889e60e92', 'SYS_INFLUX_DB_URL': 'http://paddlecloud.baidu-int.com:80', 'PMIX_PTL_MODULE': 'tcp,usock', '_STDBUF_O': 'L', 'TRAINER_INSTANCES_NUM': '1', 'TRAININGJOB_NAME': 'job-0bb64be60f44d054', 'MASTERMEMORY': '300Mi', 'CGPU_COUNT': '2', 'PMIX_SERVER_URI21': '3340042240.0;tcp4://127.0.0.1:10000', 'TRAINER_MEMORY_REQUESTS': '110Gi', 'SYS_API_PORT': '80', 'K8S_TMP_DIR_NAME': 'tmp', 'PSERVER_LOADSAVE_PARAMETERS_IN_PSERVER': '0', 'PADDLE_LOCAL_SAVE_DIR': './output', 'OMPI_MCA_orte_precondition_transports': 'df803d45ff64c5fb-ea9fa3e51b7ad1b5', 'TEST_DATA_ID': '', 'FLUME_SERVER_PORT': '35371', 'SYS_LOCAL_SAVE_DIR': './output', 'IDE_WORKDIR': '/home/work/mnt/project', 'NVIDIA_TOOLS': '/home/opt/cuda_tools', 'TRAINING_ROLE': 'TRAINER', 'IS_FILEBEAT': '1', 'HUB_HOME': '/root/paddlejob/workspace', 'FLUME_HOME': '/root/paddlejob/flume-1.8.0', 'PATH': '/home/opt/cuda_tools/:/opt/_internal/cpython-3.7.0/bin:/opt/conda/envs/py36/bin:/usr/local/bin:/usr/bin:/root/paddlejob/hadoop-client/hadoop/bin:/usr/local/bin:/usr/local/openmpi-3.1.0/bin:/home/cmake-3.16.0-Linux-x86_64/bin:/home/opt/cuda_tools:/root/paddlejob/jdk-1.8.0/bin:/root/paddlejob/flume-1.8.0/bin:/root/paddlejob/hadoop-client/hadoop/bin:/usr/local/gcc-8.2/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'VTFS_VERSION': '', 'NCCL_DEBUG_SUBSYS': 'INIT', 'STORAGE_TYPE': 'afs', 'OMPMon Jul 24 19:59:37 2023[1,0]:I_MCA_orte_jobfam_session_dir': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817', 'OMPI_MCA_orte_tmpdir_base': '/tmp', 'OMPI_COMM_WORLD_LOCAL_SIZE': '1', 'IPATH_NO_BACKTRACE': '1', 'SYS_JOB_NAME': 'ftn-jn-vru-20230724-1930', 'SSHD_PORT': '35370', 'PSERVER_MEMORY_LIMITS': '0', 'END_POINT': 'client', 'SYS_JOB_ID': 'job-0bb64be60f44d054', 'AFS_LOCAL_MOUNT_POINT': '/root/paddlejob/workspace/env_run/afs/', 'PWD': '/root/paddlejob', 'OMPI_MCA_orte_tag_output': '1', 'VERSION_LIST_SUPPORT_PY3': '', 'PADDLE_PORT': '35368', 'IS_STANDALONE': '1', 'PSERVER_MEMORY_REQUESTS': '0', 'JAVA_HOME': '/root/paddlejob/hadoop-client/hadoop/../java6', 'OMPI_COMM_WORLD_SIZE': '1', 'TEST_DATA_PATH': '', 'SYS_FS_UGI': 'paddlecloud,pdcpdc2020', 'PSERVER_INSTANCES': '', 'LANG': 'en_US.UTF-8', 'PADDLE_CLUSTER_TRAIN': 'True', 'TRAININGJOB_REPLICA_NAME': 'trainer', 'TRAIN_WORKSPACE': '/root/paddlejob/workspace', 'FS_NAME': 'afs://feilian.afs.baidu.com:9902', 'CLUSTER_TYPE': 'k8s-new', 'SYS_PYTHON_CMD': 'python', 'OMPI_FIRST_RANKS': '0', 'TEACHER_JOB_ID': '', 'S_COLORS': 'auto', 'TZ': 'Asia/Shanghai', 'PADDLE_JOB_NAME': 'job-0bb64be60f44d054', 'TRAININGJOB_PORTS': '35368,35369,35370,35371,35372', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'CUDA_PKG_VERSION': '10-2=10.2.89-1', 'SYS_FS_NAME': 'afs://baihua.afs.baidu.com:9902', 'HOME_WORK_DIR': '/root/paddlejob', 'PADDLE_IS_LOCAL': '1', 'IREPO_PLAT': 'paddle-cloud', 'POD_0_IP': '10.127.19.149', 'POD_INDEX': '0', 'CUDA_VERSION': '10.2.89', 'SYS_AFS_MOUNT': 'true', 'MASTERCPU': '1', 'PADDLE_WORKERS_IP_PORT_LIST': '10.127.19.149:35368,10.127.19.149:35369', 'PMIX_GDS_MODULE': 'ds12,hash', 'DEV_DATA_PATH': '', 'JAVA_TOOL_OPTIONS': '-Djava.compiler=NONE', 'WEBIDE_PLS_PORT': '35372', 'PADDLE_TRAINER_COUNT': '2', 'SYS_SUBDIR_LEVEL': '1', 'OMPI_MCA_orte_top_session_dir': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0', 'OMPI_MCA_orte_app_num': '0', 'WEBIDE_USERID': 'c1c8b81c-5ffe-5e1e-9874-d43de34cb602', 'VSCODE_PROXY_URI': 'http://10.127.19.149:8080/proxy/{{port}}/', 'SYS_TEST_DOWNLOAD_DESTINATION': './', 'PMIX_DSTORE_ESH_BASE_PATH': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817/pmix_dstor_18817', 'PADDLE_VERSION': '', 'VDL_USE_NATIVE_API': '1', 'USE_PFS': 'false', 'UPLOAD_STATUS_SK': '2401d98a1bc65a35936d6bc0aef010f0', 'TRAININGJOB_REPLICA_TYPE': 'worker', 'PSERVER_PORTS_NUM': '0', 'HOME': '/root', 'SHLVL': '7', 'VTFS_REPO': '', 'VSCODE_GIT_ASKPASS_MAIN': '/root/code-server/lib/vscode/extensions/git/dist/askpass-main.js', 'LANGUAGE': 'en_US.UTF-8', 'GOROOT': '/usr/local/go', 'PSERVER_NUM_THREADS': '1', 'OPENMPI_HOME': '/usr/local/openmpi-3.1.0', 'PMIX_SECURITY_MODE': 'native,none', 'NCCL_IB_DISABLE': '1', 'SYS_DOWNLOAD_DESTINATION': './', 'IS_CODELAB_ENABLED': '1', 'DFS_AGENT_PORT': '21270', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'OMPI_MCA_orte_ess_node_rank': '0', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'PSERVER_MODEL_PASS': '', 'PADDLE_TRAINERS': '10.127.19.149', 'SYS_TMP_MULTIFILE_DIR': 'env_run', 'NVIDIA_LIB': '/usr/local/nvidia/lib64', 'NCCL_VERSION': '2.7.8', 'FORCE_REUSE_OUTPUT_PATH': 'True', 'OMPI_MCA_ess': '^singleton', 'HADOOP_LIB_DIR': '/root/paddlejob/hadoop-client/hadoop/lib', 'IREPO_UPLOAD_MODEL_SPACENAME': 'paddlecloud_space', 'PMIX_BFROP_BUFFER_TYPE': 'PMIX_BFROP_BUFFER_NON_DESC', 'OMPI_MCA_shmem_RUNTIME_QUERY_hint': 'mmap', 'STDOUT_LOG_PATH': '/root/paddlejob/workspace/log/train.log', 'OMPI_MCA_hwloc_base_binding_policy': 'none', 'VSCODE_GIT_IPC_HANDLE': '/tmp/vscode-git-ae2eb349bb.sock', 'LC_CTYPE': 'C.UTF-8', 'SYS_GROUP_SIZE': '1', 'USE_ECCL': '0', 'USE_PYTHON3': '1', 'CLASSPATH': '/root/paddlejob/hadoop-client/hadoop/conf:/root/paddlejob/hadoop-client/hadoop:/root/paddlejob/hadoop-client/hadoop/hadoop-2-core.jar:/root/paddlejob/hadoop-client/hadoop/lib/abaci-core-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/ant-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/ant-launcher-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/apache-mime4j-0.6.jar:/root/paddlejob/hadoop-client/hadoop/lib/ark-1.3.25-api.jar:/root/paddlejob/hMon Jul 24 19:59:37 2023[1,0]:adoop-client/hadoop/lib/asm-3.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/asm-tree-3.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/auth-client-1.1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/auth-common-1.1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-compiler-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-ipc-1.6.1-patched.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-maven-plugin-1.5.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/baas-1.1.9.jar:/root/paddlejob/hadoop-client/hadoop/lib/baidu-rpc-1.0.10.32842.jar:/root/paddlejob/hadoop-client/hadoop/lib/baidu-sos-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/bistreaming-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/bvar-trunk-SNAPSHOT.jar:/root/paddlejob/hadoop-client/hadoop/lib/cglib-nodep-2.2.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/classworlds-1.1-alpha-2.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-beanutils-1.8.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-cli-1.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-codec-1.6.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-collections-3.2.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-configuration-1.9.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-discovery-0.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-el-1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-fileupload-1.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-httpclient-3.0.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-io-2.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-lang-2.6.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-lang3-3.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-logging-1.0.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-logging-1.1.1-api.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-logging-api-1.0.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-math-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-net-1.4.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-pool2-2.4.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/com.springsource.org.apache.commons.lang-2.5.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/contiperf-1.06.jar:/root/paddlejob/hadoop-client/hadoop/lib/core-3.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/derby-10.10.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/derbyclient-10.10.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/dom4j-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/easymock-3.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/examples-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/file-management-1.2.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/guava-14.0.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/hadoop-2-common-3.5.32.jar:/root/paddlejob/hadoop-client/hadoop/lib/hadoop-2-raid-2.0.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/hamcrest-core-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/hamcrest-library-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/hsqldb-1.8.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/httpclient-4.0.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/httpcore-4.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/httpmime-4.0.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/jackson-core-asl-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jackson-mapper-asl-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jakarta-oro-2.0.8.jar:/root/paddlejob/hadoop-client/hadoop/lib/jasper-compiler-5.5.23.jar:/root/paddlejob/hadoop-client/hadoop/lib/jasper-runtime-5.5.23.jar:/root/paddlejob/hadoop-client/hadoop/lib/javassist-3.16.1-GA.jar:/root/paddlejob/hadoop-client/hadoop/lib/javax.annotation-1.0.0.v20100513-0750.jar:/root/paddlejob/hadoop-client/hadoop/lib/jaxen-1.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jets3t-0.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-6.1.14-patched.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-util-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/jna-3.5.2.jar:/root/Mon Jul 24 19:59:37 2023[1,0]:paddlejob/hadoop-client/hadoop/lib/json-20090211.jar:/root/paddlejob/hadoop-client/hadoop/lib/json-simple-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jsp-api-2.1-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/jsp-api-2.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/junit-3.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/kfs-0.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/libthrift-0.9.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/log4j-1.2.17.jar:/root/paddlejob/hadoop-client/hadoop/lib/log4j-api-2.0-beta4.jar:/root/paddlejob/hadoop-client/hadoop/lib/log4j-core-2.0-beta4.jar:/root/paddlejob/hadoop-client/hadoop/lib/lz4-1.3.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-artifact-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-artifact-manager-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-model-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-plugin-api-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-plugin-registry-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-profile-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-project-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-repository-metadata-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-settings-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-shared-io-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/mockito-all-1.10.19.jar:/root/paddlejob/hadoop-client/hadoop/lib/mockito-core-1.9.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/mysql-connector-java-5.1.30.jar:/root/paddlejob/hadoop-client/hadoop/lib/naming-sdk-java-1.0.0.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/netty-3.2.4.Final.jar:/root/paddlejob/hadoop-client/hadoop/lib/netty-3.6.6.Final.jar:/root/paddlejob/hadoop-client/hadoop/lib/objenesis-1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/oro-2.0.8.jar:/root/paddlejob/hadoop-client/hadoop/lib/paranamer-2.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/pbrpc4j-1.0.10.1-SNAPSHOT.jar:/root/paddlejob/hadoop-client/hadoop/lib/peta-4.1.21.jar:/root/paddlejob/hadoop-client/hadoop/lib/plexus-container-default-1.0-alpha-9-stable-1.jar:/root/paddlejob/hadoop-client/hadoop/lib/plexus-interpolation-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/plexus-utils-1.5.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/protobuf-java-2.4.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/reflections-0.9.9-RC1.jar:/root/paddlejob/hadoop-client/hadoop/lib/servlet-api-2.5-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/servlet-api-2.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/slf4j-api-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/slf4j-log4j12-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/snappy-java-1.1.2.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/streaming-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/tk-client-2.0.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/tools-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/ustreaming-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/velocity-1.7.jar:/root/paddlejob/hadoop-client/hadoop/lib/wagon-provider-api-1.0-beta-2.jar:/root/paddlejob/hadoop-client/hadoop/lib/xmlenc-0.52.jar:/root/paddlejob/hadoop-client/hadoop/lib/zookeeper-1.0.10.inf.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/commons-el.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/jasper-compiler.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/jasper-runtime.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/jsp-api.jar', 'VSCODE_IPC_HOOK_CLI': '/tmp/vscode-ipc-0a2fbfeb-6f25-4c63-8f33-d6f0672f3ffe.sock', 'TRAINERS': '1', 'DISTRIBUTED_TRAINER_ENDPOINTS': '10.127.19.149:35368,10.127.19.149:35369', 'K8S_SERVER': 'http://api-k8s.kongming.baidu-int.com:8180', 'LOG': 'log', 'SYS_PRIVILEGE_AK': '5aed3a0335c4501a9e697fce1af1ca36', 'TRAINER_HOSTS_NUM': '5', 'IS_OUTPUT_AUTO_UPLOAD': '1', 'USE_HOST_PORT_ALLOC': '0', 'TRAINERS_NUM': '1', 'TRAININGJOB_REPLICA_INDEX': '0', 'MPI_SLOTS_NUM': '1', 'TRAININGJOB_REPLICA_RESTARTCOUNT': '0', 'WITH_GPU': 'ON', 'GOPATH': '/root/gopath', 'CGPU0_COMPUTE_LIMIT': '100', 'SCRIPT_UPLOAD_PATH': '/rMon Jul 24 19:59:37 2023[1,0]:0,point_loss:29.8723 Mon Jul 24 19:59:37 2023[1,0]: cross_num: 12.0, cross_loss:0.6826 Mon Jul 24 19:59:37 2023[1,0]: close_l_num: 11.0, close_l_loss:0.6953 Mon Jul 24 19:59:37 2023[1,0]: kf_top1_lane_accuracy:0.7041,kf_top3_lane_accuracy:0.8980 Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:34 value:-nan Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:0 value:0.498221 Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:1 value:0.024911 Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:2 value:0.498221 Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:26 value:-nan Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:31 value:-nan Mon Jul 24 19:59:37 2023[1,0]:In block 0, there has 3,0,36 nan,inf,num Mon Jul 24 19:59:37 2023[1,0]:Error: /paddle/paddle/fluid/framework/details/nan_inf_utils_detail.cu:105 Assertion false failed. ===ERROR: in [op=gather] [tensor=] find nan or inf=== Mon Jul 24 19:59:37 2023[1,0]:terminate called after throwing an instance of 'phi::enforce::EnforceNotMet' Mon Jul 24 19:59:37 2023[1,0]: what(): (External) CUDA error(719), unspecified launch failure. Mon Jul 24 19:59:37 2023[1,0]: [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc:80) Mon Jul 24 19:59:37 2023[1,0]: Mon Jul 24 19:59:37 2023[1,0]: Mon Jul 24 19:59:37 2023[1,0]: Mon Jul 24 19:59:37 2023[1,0]:-------------------------------------- Mon Jul 24 19:59:37 2023[1,0]:C++ Traceback (most recent call last): Mon Jul 24 19:59:37 2023[1,0]:-------------------------------------- Mon Jul 24 19:59:37 2023[1,0]:0 egr::Backward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool) Mon Jul 24 19:59:37 2023[1,0]:1 egr::RunBackward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&) Mon Jul 24 19:59:37 2023[1,0]:2 egr::GradNodeAccumulation::operator()(paddle::small_vector<std::vector<paddle::experimental::Tensor, std::allocator >, 15u>&, bool, bool) Mon Jul 24 19:59:37 2023[1,0]:3 egr::GradNodeAccumulation::ApplyReduceHooks() Mon Jul 24 19:59:37 2023[1,0]:4 paddle::distributed::EagerReducer::AddDistHook(unsigned long) Mon Jul 24 19:59:37 2023[1,0]:5 paddle::distributed::EagerReducer::MarkVarReady(unsigned long, bool) Mon Jul 24 19:59:37 2023[1,0]:6 paddle::distributed::EagerReducer::FinalizeBackward() Mon Jul 24 19:59:37 2023[1,0]:7 paddle::experimental::Tensor::reset() Mon Jul 24 19:59:37 2023[1,0]:8 std::_Sp_counted_ptr_inplace<phi::DenseTensor, std::allocator, (gnu_cxx::_Lock_policy)2>::_M_dispose() Mon Jul 24 19:59:37 2023[1,0]:9 std::_Sp_counted_deleter<phi::Allocation, std::function<void (phi::Allocation)>, std::allocator, (gnu_cxx::_Lock_policy)2>::_M_dispose() Mon Jul 24 19:59:37 2023[1,0]:10 paddle::memory::allocation::StatAllocator::FreeImpl(phi::Allocation) Mon Jul 24 19:59:37 2023[1,0]:11 paddle::memory::allocation::RetryAllocator::FreeImpl(phi::Allocation) Mon Jul 24 19:59:37 2023[1,0]: Mon Jul 24 19:59:37 2023[1,0]:---------------------- Mon Jul 24 19:59:37 2023[1,0]:Error Message Summary: Mon Jul 24 19:59:37 2023[1,0]:---------------------- Mon Jul 24 19:59:37 2023[1,0]:FatalError: Process abort signal is detected by the operating system. Mon Jul 24 19:59:37 2023[1,0]: [TimeInfo: Aborted at 1690199976 (unix time) try "date -d @1690199976" if you are using GNU date ] Mon Jul 24 19:59:37 2023[1,0]: [SignalInfo: SIGABRT (@0x4a22) received by PID 18978 (TID 0x7f1d3be0d640) from PID 18978 ] Mon Jul 24 19:59:37 2023[1,0]:oot/paddlejob/workspace/upload_tools', 'PMIX_SERVER_TMPDIR': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817/0/0', 'IREPO_INIT_MODEL_REVISION': '', 'DEV_DATA_ID': '', 'BROWSER': '/root/code-server/lib/vscode/bin/helpers/browser.sh', 'PADDLE_TRAINING_ROLE': 'TRAINER', 'OMPI_COMM_WORLD_NODE_RANK': '0', 'THIRDPARTY_PATH': '', 'VSCODE_GIT_ASKPASS_NODE': '/root/code-server/lib/node', 'GIT_ASKPASS': '/root/code-server/lib/vscode/extensions/git/dist/askpass.sh', 'HUB_SERVER': 'http://gzbh-aip-paddlehub01.gzbh.baidu.com:8888/paddlehub;http://gzbh-aip-paddlehub02.gzbh.baidu.com:8888/paddlehub', 'K8S_TRAINERS_COUNT': '1', 'SYS_EXP_TOKEN': 'e1adece161b93e74f343771875605a93', 'OMPI_NUM_APP_CTX': '1', 'IREPO_INIT_MODEL_REPONAME': '', 'SYS_RDMA_TCP': '', 'DICT_PATH': '', 'NODE_EXEC_PATH': '/root/code-server/lib/node', 'KUBERNETES_PORT_443_TCP_ADDR': '11.1.0.1', 'MOUNT_AFS': 'true', 'TRAININGJOB_POD_TYPE': 'normal', 'TRAINER_PORTS': '35368,35369', 'SYS_NCCL_CHECK': '0', 'TRAININGJOB_SERVICE': '10.127.19.149', 'UPLOAD_STATUS_AK': 'eb8a2a50d44a5a9e8e77896d1670302c', 'STDERR_LOG_PATH': '/root/paddlejob/workspace/log/err.log', 'TRAININGJOB_NAMESPACE': 'group-1f8cc06f-1968-dbae-e20c-7fa9216c1971', 'NVIDIA_VISIBLE_GPUS_SLOT': '4,5', 'WEBIDE_CLOUD_SERVER_URL': 'http://codelab.baidu-int.com/cloud', 'JOB_HEARTBEAT_INTERVAL': '25', 'NCCL_IB_QPS_PER_CONNECTION': '8', 'KUBERNETES_PORT_443_TCP': 'tcp://11.1.0.1:443', 'PADDLE_EDL_AUTO_CHECKPOINT_FLAG': '1', 'TRAIN_DATA_PATH': '', 'SYS_HYPER_PARAMS': '', 'START_CMD': 'sh submit/start_cmd.sh', 'PADDLE_NUM_GRADIENT_SERVERS': '1', 'FLAGS_cudnn_exhaustive_search': 'True', 'OMPI_MCA_orte_local_daemon_uri': '3340042240.0;tcp://10.127.19.149,192.168.5.1:10114', 'HFI_NO_BACKTRACE': '1', 'INIT_MODEL_PATH': '', 'PMIX_SERVER_URI2': '3340042240.0;tcp4://127.0.0.1:10000', 'DISTILLATION_DATA_PATH': '', 'SYS_NICS': '', 'COLORTERM': 'truecolor', 'POD_IP': '10.127.19.149', 'IREPO_URL': 'http://irepo.baidu-int.com', 'SYS_BASE_FILEPATH': '/user/paddlecloud/paddle-platform', 'K8S_PSERVERSCOUNT': '0', '': '/usr/local/bin/python3', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/usr/local/lib/python3.7/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/usr/local/lib/python3.7/site-packages/cv2/qt/fonts', 'POD_NAME': 'cvznnn', 'PADDLE_MASTER': '10.127.19.149:41403', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'} Mon Jul 24 19:59:37 2023[1,0]:LAUNCH INFO 2023-07-24 19:59:37,828 ------------------------- ERROR LOG DETAIL ------------------------- Mon Jul 24 19:59:37 2023[1,0]: Mon Jul 24 19:59:38 2023[1,0]:LAUNCH INFO 2023-07-24 19:59:38,429 Exit code -6

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


orterun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[50965,1],0] Exit code: 250

warrentdrew commented 1 year ago

出现了nan的情况,可以export FLAGS_check_nan_inf=1运行定位具体的出现nan的位置 参考https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/flags/check_nan_inf_cn.html#check-nan-inf