Closed soyons closed 1 year ago
python3.7.0 v100-32g NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 paddlepaddle-gpu 2.4.2
false
Process abort signal
orterun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
出现了nan的情况,可以export FLAGS_check_nan_inf=1运行定位具体的出现nan的位置 参考https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/flags/check_nan_inf_cn.html#check-nan-inf
问题描述 Please describe your issue
python3.7.0 v100-32g NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
paddlepaddle-gpu 2.4.2
Mon Jul 24 19:59:20 2023[1,0]:Total params: 1172702
Mon Jul 24 19:59:20 2023[1,0]:Trainable params: 166536
Mon Jul 24 19:59:20 2023[1,0]:Non-trainable params: 1006166
Mon Jul 24 19:59:20 2023[1,0]:Training with custom optimizer
Mon Jul 24 19:59:26 2023[1,0]:/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:277: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.float32, the right dtype will convert to paddle.int64
Mon Jul 24 19:59:26 2023[1,0]: .format(lhs_dtype, rhs_dtype, lhs_dtype))
Mon Jul 24 19:59:26 2023[1,0]:/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:277: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.int64, the right dtype will convert to paddle.float32
Mon Jul 24 19:59:26 2023[1,0]: .format(lhs_dtype, rhs_dtype, lhs_dtype))
Mon Jul 24 19:59:26 2023[1,0]:/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:277: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int32, but right dtype is paddle.int64, the right dtype will convert to paddle.int32
Mon Jul 24 19:59:26 2023[1,0]: .format(lhs_dtype, rhs_dtype, lhs_dtype))
Mon Jul 24 19:59:28 2023[1,0]:epoch:0,step:10,rank:0,loss:0.0238,lr:0.000010
Mon Jul 24 19:59:28 2023[1,0]: decision_loss:0.023788, decision: 42, point: 50
Mon Jul 24 19:59:28 2023[1,0]: binary_gaussian:-0.8112,kf_lane_loss:2.6619, anchor_loss:0.0000,point_loss:58.3817
Mon Jul 24 19:59:28 2023[1,0]: cross_num: 16.0, cross_loss:0.6736
Mon Jul 24 19:59:28 2023[1,0]: close_l_num: 8.0, close_l_loss:0.5307
Mon Jul 24 19:59:28 2023[1,0]: kf_top1_lane_accuracy:0.6600,kf_top3_lane_accuracy:0.8800
Mon Jul 24 19:59:32 2023[1,0]:epoch:0,step:20,rank:0,loss:0.0104,lr:0.000010
Mon Jul 24 19:59:32 2023[1,0]: decision_loss:0.010432, decision: 36, point: 93
Mon Jul 24 19:59:32 2023[1,0]: binary_gaussian:0.4726,kf_lane_loss:6.7281, anchor_loss:0.0000,point_loss:41.7945
Mon Jul 24 19:59:32 2023[1,0]: cross_num: 16.0, cross_loss:0.8788
Mon Jul 24 19:59:32 2023[1,0]: close_l_num: 20.0, close_l_loss:1.7077
Mon Jul 24 19:59:32 2023[1,0]: kf_top1_lane_accuracy:0.7527,kf_top3_lane_accuracy:0.9032
Mon Jul 24 19:59:34 2023[1,0]:epoch:0,step:30,rank:0,loss:0.0274,lr:0.000010
Mon Jul 24 19:59:34 2023[1,0]: decision_loss:0.027361, decision: 40, point: 98
Mon Jul 24 19:59:34 2023[1,0]: binary_gaussian:0.2453,kf_lane_loss:5.5190, anchor_loss:0.0000,point_loss:29.8723
Mon Jul 24 19:59:34 2023[1,0]: cross_num: 12.0, cross_loss:0.6826
Mon Jul 24 19:59:34 2023[1,0]: close_l_num: 11.0, close_l_loss:0.6953
Mon Jul 24 19:59:34 2023[1,0]: kf_top1_lane_accuracy:0.7041,kf_top3_lane_accuracy:0.8980
Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:34 value:-nan
Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:0 value:0.498221
Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:1 value:0.024911
Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:2 value:0.498221
Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:26 value:-nan
Mon Jul 24 19:59:36 2023[1,0]:numel:39 idx:31 value:-nan
Mon Jul 24 19:59:36 2023[1,0]:In block 0, there has 3,0,36 nan,inf,num
Mon Jul 24 19:59:36 2023[1,0]:Error: /paddle/paddle/fluid/framework/details/nan_inf_utils_detail.cu:105 Assertion :terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
Mon Jul 24 19:59:36 2023[1,0]: what(): (External) CUDA error(719), unspecified launch failure.
Mon Jul 24 19:59:36 2023[1,0]: [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc:80)
Mon Jul 24 19:59:36 2023[1,0]:
Mon Jul 24 19:59:36 2023[1,0]:
Mon Jul 24 19:59:36 2023[1,0]:
Mon Jul 24 19:59:36 2023[1,0]:--------------------------------------
Mon Jul 24 19:59:36 2023[1,0]:C++ Traceback (most recent call last):
Mon Jul 24 19:59:36 2023[1,0]:--------------------------------------
Mon Jul 24 19:59:36 2023[1,0]:0 egr::Backward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool)
Mon Jul 24 19:59:36 2023[1,0]:1 egr::RunBackward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&)
Mon Jul 24 19:59:36 2023[1,0]:2 egr::GradNodeAccumulation::operator()(paddle::small_vector<std::vector<paddle::experimental::Tensor, std::allocator >, 15u>&, bool, bool)
Mon Jul 24 19:59:36 2023[1,0]:3 egr::GradNodeAccumulation::ApplyReduceHooks()
Mon Jul 24 19:59:36 2023[1,0]:4 paddle::distributed::EagerReducer::AddDistHook(unsigned long)
Mon Jul 24 19:59:36 2023[1,0]:5 paddle::distributed::EagerReducer::MarkVarReady(unsigned long, bool)
Mon Jul 24 19:59:36 2023[1,0]:6 paddle::distributed::EagerReducer::FinalizeBackward()
Mon Jul 24 19:59:36 2023[1,0]:7 paddle::experimental::Tensor::reset()
Mon Jul 24 19:59:36 2023[1,0]:8 std::_Sp_counted_ptr_inplace<phi::DenseTensor, std::allocator, (gnu_cxx::_Lock_policy)2>::_M_dispose()
Mon Jul 24 19:59:36 2023[1,0]:9 std::_Sp_counted_deleter<phi::Allocation, std::function<void (phi::Allocation)>, std::allocator, ( gnu_cxx::_Lock_policy)2>::_M_dispose()
Mon Jul 24 19:59:36 2023[1,0]:10 paddle::memory::allocation::StatAllocator::FreeImpl(phi::Allocation)
Mon Jul 24 19:59:36 2023[1,0]:11 paddle::memory::allocation::RetryAllocator::FreeImpl(phi::Allocation )
Mon Jul 24 19:59:36 2023[1,0]:
Mon Jul 24 19:59:36 2023[1,0]:----------------------
Mon Jul 24 19:59:36 2023[1,0]:Error Message Summary:
Mon Jul 24 19:59:36 2023[1,0]:----------------------
Mon Jul 24 19:59:36 2023[1,0]:FatalError: : [TimeInfo: Aborted at 1690199976 (unix time) try "date -d @1690199976" if you are using GNU date ]
Mon Jul 24 19:59:36 2023[1,0]: [SignalInfo: SIGABRT (@0x4a22) received by PID 18978 (TID 0x7f1d3be0d640) from PID 18978 ]
Mon Jul 24 19:59:36 2023[1,0]:
Mon Jul 24 19:59:37 2023[1,0]:LAUNCH INFO 2023-07-24 19:59:37,827 Pod failed
Mon Jul 24 19:59:37 2023[1,0]:LAUNCH ERROR 2023-07-24 19:59:37,828 Container failed !!!
Mon Jul 24 19:59:37 2023[1,0]:Container rank 0 status failed cmd ['/usr/local/bin/python3', '-u', 'multipath/predecision_main.py', '--config', 'multipath/conf/predecision.conf', '--data_path', 'afs/jn_bicycle_rule/junction_gostraight'] code -6 log log/workerlog.0
Mon Jul 24 19:59:37 2023[1,0]:env {'PYTHONPATH': '.', 'CPLUS_INCLUDE_PATH': '/usr/local/python2.7.15/include/python2.7:/usr/local/python3.5.1/include/python3.5:', 'PSERVERS_NUM': '', 'CUDNN_VERSION': '7.6.5.32', 'SYS_REAL_DOWNLOAD': '1', 'KUBE_DEPENDENCY': '/home/kubernetes/dependency', 'CGPU1_SHAREMODE': '7', 'SYS_OUTPUT_PATH': '/user/ad-pnc/prediction/train_result//liqinghai01/job-0bb64be60f44d054', 'TRAINER_GPU_CARD_COUNT': '2', 'PMIX_ID': '3340042241.0', 'OMPI_COMM_WORLD_RANK': '0', 'PMIX_NAMESPACE': '3340042241', 'PSERVER_IP_PORT_LIST': '', 'PREDICT_DATA_ID': '', 'GPU_RATIO': '1.0', 'SYS_VOLUME_MOUNT': '/root/paddlejob/workspace', 'PSERVER_MODEL_DIR': '', 'TERM_PROGRAM': 'vscode', 'HOSTNAME': 'yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com', 'PADDLE_USE_GPU': '1', 'HOSTNAME_TO_IP': '1', 'version': '2.7.15', 'NVIDIA_REQUIRE_CUDA': 'cuda>=10.2 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441', 'FS_UGI': 'ad-pnc-mix,ad-pnc-mix_passw0rd', 'PSERVER_IP_LIST': '', 'TERM': 'xterm-256color', 'SYS_JOB_VERSION': 'paddle-v2.4.0', 'KUBERNETES_PORT': 'tcp://11.1.0.1:443', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'CGPU1_COMPUTE_LIMIT': '100', 'FLAGS_call_stack_level': '1', 'OMPI_MCA_pmix': '^s1,s2,cray,isolated', 'OMPI_MCA_orte_ess_num_procs': '1', 'NCCL_SOCKET_IFNAME': 'xgbe0', 'COMBINED_OUTPUT_PATH': '/user/ad-pnc/prediction/train_result//liqinghai01/job-0bb64be60f44d054/', 'SYS_IS_ABACUS_CLUSTER': '0', 'HADOOP_HOME': '/root/paddlejob/hadoop-client/hadoop', 'PSERVERS': '', 'DICT_ID': '', 'TRIANER_IP_LIST': '10.127.19.149', 'SYS_DOWNLOAD_THREAD_NUM': '15', 'AIFLOW_URL': 'paddlecloud.baidu-int.com', 'CGPU0_SHAREMODE': '7', 'OMPI_MCA_ess_base_vpid': '0', 'IREPO_UPLOAD_MODEL_REPONAME': '', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'MPI_ON_K8S': '1', 'OMPI_MCA_ess_base_jobid': '3340042241', 'TERM_PROGRAM_VERSION': '1.75.1', 'LD_PRELOAD': '/usr/lib/x86_64-linux-gnu/coreutils/libstdbuf.so', 'IS_WHITELIST_JOB': '0', 'KUBERNETES_SERVICE_PORT': '443', 'TRAINER_INSTANCES': '10.127.19.149', 'PMIX_RANK': '0', 'TRAINER_PORTS_NUM': '2', 'OLDPWD': '/root/paddlejob', 'OMPI_MCA_orte_launch': '1', 'PADDLE_TRAINERS_NUM': '2', 'TRAINER_IP_PORT_LIST': '10.127.19.149:35368,10.127.19.149:35369', 'SYS_PRIVILEGE_SK': '2c49b0baef1b5467a2353b2821782f5b', 'FAULT_TOLERANT': 'False', 'K8S_ENTRY_FILE_NAME': 'trainer.py', 'CLUSTER_NAME': 'v100-32-0-cluster', 'K8S_ENTRY_CMD': 'python trainer.py', 'OMPI_MCA_orte_num_nodes': '1', 'NCCL_DEBUG_FILE': '/root/paddlejob/workspace/log/nccl.%p.log', 'VDL_LOG_PATH': 'afs://ad-pnc-mix:ad-pnc-mix_passw0rd@feilian.afs.baidu.com:9902/user/ad-pnc/prediction/train_result//liqinghai01/job-0bb64be60f44d054/visualdl_log_dir', 'KUBERNETES_SERVICE_HOST': '11.1.0.1', 'FLAGS_check_nan_inf': 'True', 'OMPI_COMM_WORLD_LOCAL_RANK': '0', 'OMPI_MCA_orte_hnp_uri': '3340042240.0;tcp://10.127.19.149,192.168.5.1:10114', 'PADDLE_TRAINER_ID': '0', 'SYS_VOLUME_PATH': '/home/work/containers/413a96eb-0844-451a-b2b4-fd2004bbe1be', 'OMPI_ARGV': 'submit/job.sh', 'OMPI_MCA_initial_wdir': '/root/paddlejob/workspace/env_run', 'TRAIN_DATA_ID': '', 'LC_ALL': 'en_US.UTF-8', 'SYS_USER_NAME': 'liqinghai01', 'SYS_USE_HADOOP_VFS': 'False', 'SYS_SERVICE_PORT': '8676', 'OUTPUT_PATH': '/user/ad-pnc/prediction/train_result/', 'WEBIDE_PLATFORM': 'PaddleCloud_Job', 'PADDLE_PORTS_NUM': '26', 'CODE_URI': '/user/ad-pnc/prediction/train_result//paddlecloud_code/junction_noturn_paddle_20230724193052516119.tar.gz', 'AFS_REMOTE_MOUNT_POINT': '/user/ad-pnc/prediction/', 'DISTRIBUTE_JOB_TYPE': 'PSERVER', 'ETCD_IMAGE': 'registry.baidu.com/bml/etcd:v3.2.1', 'AIFLOW_PLAT_NAME': 'pdc_backend', 'POD_0_PORTS': '35368,35369', 'RUNTIME_WORKDIR': '/root/paddlejob/workspace/env_run', 'JRE_HOME': '/jre', 'SYS_USER_ID': 'c1c8b81c-5ffeMon Jul 24 19:59:37 2023[1,0]:-5e1e-9874-d43de34cb602', 'OMPI_MCA_orte_timestamp_output': '1', 'IREPO_INIT_MODEL_SPACENAME': 'paddlecloud_space', 'NCCL_IB_TIMEOUT': '22', 'NCCL_IB_GID_INDEX': '3', 'TRAINER_MEMORY_LIMITS': '110Gi', 'NVIDIA_VISIBLE_DEVICES': 'GPU-599d41b0-3420-628e-57d2-e4a4034efb9c,GPU-6379c489-466f-2081-3c07-16bbe5ac5451', 'FAULT_TOLERANCE_ENV_PATH': '/root/paddlejob/fault_tolerance.env', 'LD_LIBRARY_PATH': '/opt/_internal/cpython-3.7.0/lib:/opt/conda/envs/py36/lib:/usr/local/lib:/usr/local/python2.7.15/lib:/opt/_internal/cpython-2.7.11-ucs4/lib:/opt/_internal/cpython-2.7.15-ucs4/lib:/opt/conda/envs/py27/lib:/opt/OpenBLAS:/:/opt/hadoop-client/hadoop/../java6/jre/lib/amd64:/opt/hadoop-client/hadoop/../java6/jre/lib/amd64/native_threads:/opt/hadoop-client/hadoop/../java6/jre/lib/amd64/server:/opt/hadoop-client/hadoop/lib/native/Linux-amd64-64:/usr/local/x86_64-pc-linux-gnu/lib:/home/opt/nvidia_lib:/usr/local/cuda/lib64:/usr/lib64:/usr/local/lib:/nccl/lib:/home/work/cudnn/cudnn_v7/cuda/lib64:/home/work/cudnn/cudnn_v6/cuda/lib64:/home/work/cudnn/cudnn_v5/cuda/lib64:/home/work/cuda-9.0/lib64:/home/work/cuda-8.0/lib64:/usr/lib64/mlnx_ofed/valgrind:/usr/lib/x86_64-linux-gnu/:/usr/local/lib/python2.7/site-packages/paddle/libs:$LD_LIBRARY_PATH:/root/paddlejob/hadoop-client/hadoop/libdfs/:/root/paddlejob/hadoop-client/hadoop/../java6/jre/lib/amd64:/root/paddlejob/hadoop-client/hadoop/../java6/jre/lib/amd64/native_threads:/root/paddlejob/hadoop-client/hadoop/../java6/jre/lib/amd64/server:/root/paddlejob/hadoop-client/hadoop/lib/native/Linux-amd64-64', 'OMPI_UNIVERSE_SIZE': '1', 'PSERVER_PORTS': '', 'PREDICT_DATA_PATH': '', 'JOB_CATEGORY': 'general', 'NAMESPACE': 'group-1f8cc06f-1968-dbae-e20c-7fa9216c1971', 'TRAIN_LOG_PATH': '/root/paddlejob/workspace/log/run.log', 'OMPI_MCA_mpi_yield_when_idle': '0', 'GPUTRAINER_ENDPOINTS': '10.127.19.149:35368,10.127.19.149:35369', 'PADDLE_CURRENT_ENDPOINT': '10.127.19.149:41404', 'WITH_AVX': 'ON', 'NVIDIA_VISIBLE_GPUS_UUID': 'GPU-599d41b0-3420-628e-57d2-e4a4034efb9c,GPU-6379c489-466f-2081-3c07-16bbe5ac5451', 'SYS_TMP_FILEPATH': '/user/paddlecloud/paddle-platform/buffer', 'SYS_API_HOST': 'paddlecloud.baidu-int.com', 'RESERVED_PORT_NUM': '3', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'OMPI_COMMAND': 'sh', 'OMPI_FILE_LOCATION': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817/0/0', 'OMPI_APP_CTX_NUM_PROCS': '1', 'SYS_URL_PREFIX': 'yq01-rdqa-bml27.yq01.baidu.com', 'DFS_USE_NATIVE_API': '0', 'AIFLOW_PLAT_TOKEN': '34bc918c410b1c9bae760c97ad1796fc', 'PADDLE_JOB_DIR': '/root/paddlejob', 'PADDLE_TRAINER_ENDPOINTS': '10.127.19.149:41404,10.127.19.149:41405', 'IREPO_PLAT_TOKEN': 'c8571e89-e7be-4aa7-9c9a-ac1889e60e92', 'SYS_INFLUX_DB_URL': 'http://paddlecloud.baidu-int.com:80', 'PMIX_PTL_MODULE': 'tcp,usock', '_STDBUF_O': 'L', 'TRAINER_INSTANCES_NUM': '1', 'TRAININGJOB_NAME': 'job-0bb64be60f44d054', 'MASTERMEMORY': '300Mi', 'CGPU_COUNT': '2', 'PMIX_SERVER_URI21': '3340042240.0;tcp4://127.0.0.1:10000', 'TRAINER_MEMORY_REQUESTS': '110Gi', 'SYS_API_PORT': '80', 'K8S_TMP_DIR_NAME': 'tmp', 'PSERVER_LOADSAVE_PARAMETERS_IN_PSERVER': '0', 'PADDLE_LOCAL_SAVE_DIR': './output', 'OMPI_MCA_orte_precondition_transports': 'df803d45ff64c5fb-ea9fa3e51b7ad1b5', 'TEST_DATA_ID': '', 'FLUME_SERVER_PORT': '35371', 'SYS_LOCAL_SAVE_DIR': './output', 'IDE_WORKDIR': '/home/work/mnt/project', 'NVIDIA_TOOLS': '/home/opt/cuda_tools', 'TRAINING_ROLE': 'TRAINER', 'IS_FILEBEAT': '1', 'HUB_HOME': '/root/paddlejob/workspace', 'FLUME_HOME': '/root/paddlejob/flume-1.8.0', 'PATH': '/home/opt/cuda_tools/:/opt/_internal/cpython-3.7.0/bin:/opt/conda/envs/py36/bin:/usr/local/bin:/usr/bin:/root/paddlejob/hadoop-client/hadoop/bin:/usr/local/bin:/usr/local/openmpi-3.1.0/bin:/home/cmake-3.16.0-Linux-x86_64/bin:/home/opt/cuda_tools:/root/paddlejob/jdk-1.8.0/bin:/root/paddlejob/flume-1.8.0/bin:/root/paddlejob/hadoop-client/hadoop/bin:/usr/local/gcc-8.2/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'VTFS_VERSION': '', 'NCCL_DEBUG_SUBSYS': 'INIT', 'STORAGE_TYPE': 'afs', 'OMPMon Jul 24 19:59:37 2023[1,0]:I_MCA_orte_jobfam_session_dir': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817', 'OMPI_MCA_orte_tmpdir_base': '/tmp', 'OMPI_COMM_WORLD_LOCAL_SIZE': '1', 'IPATH_NO_BACKTRACE': '1', 'SYS_JOB_NAME': 'ftn-jn-vru-20230724-1930', 'SSHD_PORT': '35370', 'PSERVER_MEMORY_LIMITS': '0', 'END_POINT': 'client', 'SYS_JOB_ID': 'job-0bb64be60f44d054', 'AFS_LOCAL_MOUNT_POINT': '/root/paddlejob/workspace/env_run/afs/', 'PWD': '/root/paddlejob', 'OMPI_MCA_orte_tag_output': '1', 'VERSION_LIST_SUPPORT_PY3': '', 'PADDLE_PORT': '35368', 'IS_STANDALONE': '1', 'PSERVER_MEMORY_REQUESTS': '0', 'JAVA_HOME': '/root/paddlejob/hadoop-client/hadoop/../java6', 'OMPI_COMM_WORLD_SIZE': '1', 'TEST_DATA_PATH': '', 'SYS_FS_UGI': 'paddlecloud,pdcpdc2020', 'PSERVER_INSTANCES': '', 'LANG': 'en_US.UTF-8', 'PADDLE_CLUSTER_TRAIN': 'True', 'TRAININGJOB_REPLICA_NAME': 'trainer', 'TRAIN_WORKSPACE': '/root/paddlejob/workspace', 'FS_NAME': 'afs://feilian.afs.baidu.com:9902', 'CLUSTER_TYPE': 'k8s-new', 'SYS_PYTHON_CMD': 'python', 'OMPI_FIRST_RANKS': '0', 'TEACHER_JOB_ID': '', 'S_COLORS': 'auto', 'TZ': 'Asia/Shanghai', 'PADDLE_JOB_NAME': 'job-0bb64be60f44d054', 'TRAININGJOB_PORTS': '35368,35369,35370,35371,35372', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'CUDA_PKG_VERSION': '10-2=10.2.89-1', 'SYS_FS_NAME': 'afs://baihua.afs.baidu.com:9902', 'HOME_WORK_DIR': '/root/paddlejob', 'PADDLE_IS_LOCAL': '1', 'IREPO_PLAT': 'paddle-cloud', 'POD_0_IP': '10.127.19.149', 'POD_INDEX': '0', 'CUDA_VERSION': '10.2.89', 'SYS_AFS_MOUNT': 'true', 'MASTERCPU': '1', 'PADDLE_WORKERS_IP_PORT_LIST': '10.127.19.149:35368,10.127.19.149:35369', 'PMIX_GDS_MODULE': 'ds12,hash', 'DEV_DATA_PATH': '', 'JAVA_TOOL_OPTIONS': '-Djava.compiler=NONE', 'WEBIDE_PLS_PORT': '35372', 'PADDLE_TRAINER_COUNT': '2', 'SYS_SUBDIR_LEVEL': '1', 'OMPI_MCA_orte_top_session_dir': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0', 'OMPI_MCA_orte_app_num': '0', 'WEBIDE_USERID': 'c1c8b81c-5ffe-5e1e-9874-d43de34cb602', 'VSCODE_PROXY_URI': 'http://10.127.19.149:8080/proxy/{{port}}/', 'SYS_TEST_DOWNLOAD_DESTINATION': './', 'PMIX_DSTORE_ESH_BASE_PATH': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817/pmix_dstor_18817', 'PADDLE_VERSION': '', 'VDL_USE_NATIVE_API': '1', 'USE_PFS': 'false', 'UPLOAD_STATUS_SK': '2401d98a1bc65a35936d6bc0aef010f0', 'TRAININGJOB_REPLICA_TYPE': 'worker', 'PSERVER_PORTS_NUM': '0', 'HOME': '/root', 'SHLVL': '7', 'VTFS_REPO': '', 'VSCODE_GIT_ASKPASS_MAIN': '/root/code-server/lib/vscode/extensions/git/dist/askpass-main.js', 'LANGUAGE': 'en_US.UTF-8', 'GOROOT': '/usr/local/go', 'PSERVER_NUM_THREADS': '1', 'OPENMPI_HOME': '/usr/local/openmpi-3.1.0', 'PMIX_SECURITY_MODE': 'native,none', 'NCCL_IB_DISABLE': '1', 'SYS_DOWNLOAD_DESTINATION': './', 'IS_CODELAB_ENABLED': '1', 'DFS_AGENT_PORT': '21270', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'OMPI_MCA_orte_ess_node_rank': '0', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'PSERVER_MODEL_PASS': '', 'PADDLE_TRAINERS': '10.127.19.149', 'SYS_TMP_MULTIFILE_DIR': 'env_run', 'NVIDIA_LIB': '/usr/local/nvidia/lib64', 'NCCL_VERSION': '2.7.8', 'FORCE_REUSE_OUTPUT_PATH': 'True', 'OMPI_MCA_ess': '^singleton', 'HADOOP_LIB_DIR': '/root/paddlejob/hadoop-client/hadoop/lib', 'IREPO_UPLOAD_MODEL_SPACENAME': 'paddlecloud_space', 'PMIX_BFROP_BUFFER_TYPE': 'PMIX_BFROP_BUFFER_NON_DESC', 'OMPI_MCA_shmem_RUNTIME_QUERY_hint': 'mmap', 'STDOUT_LOG_PATH': '/root/paddlejob/workspace/log/train.log', 'OMPI_MCA_hwloc_base_binding_policy': 'none', 'VSCODE_GIT_IPC_HANDLE': '/tmp/vscode-git-ae2eb349bb.sock', 'LC_CTYPE': 'C.UTF-8', 'SYS_GROUP_SIZE': '1', 'USE_ECCL': '0', 'USE_PYTHON3': '1', 'CLASSPATH': '/root/paddlejob/hadoop-client/hadoop/conf:/root/paddlejob/hadoop-client/hadoop:/root/paddlejob/hadoop-client/hadoop/hadoop-2-core.jar:/root/paddlejob/hadoop-client/hadoop/lib/abaci-core-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/ant-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/ant-launcher-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/apache-mime4j-0.6.jar:/root/paddlejob/hadoop-client/hadoop/lib/ark-1.3.25-api.jar:/root/paddlejob/hMon Jul 24 19:59:37 2023[1,0]:adoop-client/hadoop/lib/asm-3.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/asm-tree-3.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/auth-client-1.1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/auth-common-1.1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-compiler-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-ipc-1.6.1-patched.jar:/root/paddlejob/hadoop-client/hadoop/lib/avro-maven-plugin-1.5.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/baas-1.1.9.jar:/root/paddlejob/hadoop-client/hadoop/lib/baidu-rpc-1.0.10.32842.jar:/root/paddlejob/hadoop-client/hadoop/lib/baidu-sos-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/bistreaming-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/bvar-trunk-SNAPSHOT.jar:/root/paddlejob/hadoop-client/hadoop/lib/cglib-nodep-2.2.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/classworlds-1.1-alpha-2.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-beanutils-1.8.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-cli-1.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-codec-1.6.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-collections-3.2.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-configuration-1.9.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-discovery-0.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-el-1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-fileupload-1.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-httpclient-3.0.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-io-2.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-lang-2.6.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-lang3-3.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-logging-1.0.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-logging-1.1.1-api.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-logging-api-1.0.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-math-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-net-1.4.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/commons-pool2-2.4.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/com.springsource.org.apache.commons.lang-2.5.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/contiperf-1.06.jar:/root/paddlejob/hadoop-client/hadoop/lib/core-3.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/derby-10.10.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/derbyclient-10.10.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/dom4j-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/easymock-3.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/examples-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/file-management-1.2.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/guava-14.0.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/hadoop-2-common-3.5.32.jar:/root/paddlejob/hadoop-client/hadoop/lib/hadoop-2-raid-2.0.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/hamcrest-core-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/hamcrest-library-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/hsqldb-1.8.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/httpclient-4.0.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/httpcore-4.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/httpmime-4.0.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/jackson-core-asl-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jackson-mapper-asl-1.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jakarta-oro-2.0.8.jar:/root/paddlejob/hadoop-client/hadoop/lib/jasper-compiler-5.5.23.jar:/root/paddlejob/hadoop-client/hadoop/lib/jasper-runtime-5.5.23.jar:/root/paddlejob/hadoop-client/hadoop/lib/javassist-3.16.1-GA.jar:/root/paddlejob/hadoop-client/hadoop/lib/javax.annotation-1.0.0.v20100513-0750.jar:/root/paddlejob/hadoop-client/hadoop/lib/jaxen-1.1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jets3t-0.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-6.1.14-patched.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-util-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/jna-3.5.2.jar:/root/Mon Jul 24 19:59:37 2023[1,0]:paddlejob/hadoop-client/hadoop/lib/json-20090211.jar:/root/paddlejob/hadoop-client/hadoop/lib/json-simple-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/jsp-api-2.1-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/jsp-api-2.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/junit-3.8.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/kfs-0.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/libthrift-0.9.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/log4j-1.2.17.jar:/root/paddlejob/hadoop-client/hadoop/lib/log4j-api-2.0-beta4.jar:/root/paddlejob/hadoop-client/hadoop/lib/log4j-core-2.0-beta4.jar:/root/paddlejob/hadoop-client/hadoop/lib/lz4-1.3.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-artifact-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-artifact-manager-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-model-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-plugin-api-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-plugin-registry-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-profile-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-project-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-repository-metadata-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-settings-2.0.10.jar:/root/paddlejob/hadoop-client/hadoop/lib/maven-shared-io-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/mockito-all-1.10.19.jar:/root/paddlejob/hadoop-client/hadoop/lib/mockito-core-1.9.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/mysql-connector-java-5.1.30.jar:/root/paddlejob/hadoop-client/hadoop/lib/naming-sdk-java-1.0.0.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/netty-3.2.4.Final.jar:/root/paddlejob/hadoop-client/hadoop/lib/netty-3.6.6.Final.jar:/root/paddlejob/hadoop-client/hadoop/lib/objenesis-1.0.jar:/root/paddlejob/hadoop-client/hadoop/lib/oro-2.0.8.jar:/root/paddlejob/hadoop-client/hadoop/lib/paranamer-2.3.jar:/root/paddlejob/hadoop-client/hadoop/lib/pbrpc4j-1.0.10.1-SNAPSHOT.jar:/root/paddlejob/hadoop-client/hadoop/lib/peta-4.1.21.jar:/root/paddlejob/hadoop-client/hadoop/lib/plexus-container-default-1.0-alpha-9-stable-1.jar:/root/paddlejob/hadoop-client/hadoop/lib/plexus-interpolation-1.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/plexus-utils-1.5.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/protobuf-java-2.4.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/reflections-0.9.9-RC1.jar:/root/paddlejob/hadoop-client/hadoop/lib/servlet-api-2.5-6.1.14.jar:/root/paddlejob/hadoop-client/hadoop/lib/servlet-api-2.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/slf4j-api-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/slf4j-log4j12-1.6.1.jar:/root/paddlejob/hadoop-client/hadoop/lib/snappy-java-1.1.2.4.jar:/root/paddlejob/hadoop-client/hadoop/lib/streaming-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/tk-client-2.0.5.jar:/root/paddlejob/hadoop-client/hadoop/lib/tools-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/ustreaming-3.29.2.jar:/root/paddlejob/hadoop-client/hadoop/lib/velocity-1.7.jar:/root/paddlejob/hadoop-client/hadoop/lib/wagon-provider-api-1.0-beta-2.jar:/root/paddlejob/hadoop-client/hadoop/lib/xmlenc-0.52.jar:/root/paddlejob/hadoop-client/hadoop/lib/zookeeper-1.0.10.inf.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/commons-el.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/jasper-compiler.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/jasper-runtime.jar:/root/paddlejob/hadoop-client/hadoop/lib/jetty-ext/jsp-api.jar', 'VSCODE_IPC_HOOK_CLI': '/tmp/vscode-ipc-0a2fbfeb-6f25-4c63-8f33-d6f0672f3ffe.sock', 'TRAINERS': '1', 'DISTRIBUTED_TRAINER_ENDPOINTS': '10.127.19.149:35368,10.127.19.149:35369', 'K8S_SERVER': 'http://api-k8s.kongming.baidu-int.com:8180', 'LOG': 'log', 'SYS_PRIVILEGE_AK': '5aed3a0335c4501a9e697fce1af1ca36', 'TRAINER_HOSTS_NUM': '5', 'IS_OUTPUT_AUTO_UPLOAD': '1', 'USE_HOST_PORT_ALLOC': '0', 'TRAINERS_NUM': '1', 'TRAININGJOB_REPLICA_INDEX': '0', 'MPI_SLOTS_NUM': '1', 'TRAININGJOB_REPLICA_RESTARTCOUNT': '0', 'WITH_GPU': 'ON', 'GOPATH': '/root/gopath', 'CGPU0_COMPUTE_LIMIT': '100', 'SCRIPT_UPLOAD_PATH': '/rMon Jul 24 19:59:37 2023[1,0]:0,point_loss:29.8723
Mon Jul 24 19:59:37 2023[1,0]: cross_num: 12.0, cross_loss:0.6826
Mon Jul 24 19:59:37 2023[1,0]: close_l_num: 11.0, close_l_loss:0.6953
Mon Jul 24 19:59:37 2023[1,0]: kf_top1_lane_accuracy:0.7041,kf_top3_lane_accuracy:0.8980
Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:34 value:-nan
Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:0 value:0.498221
Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:1 value:0.024911
Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:2 value:0.498221
Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:26 value:-nan
Mon Jul 24 19:59:37 2023[1,0]:numel:39 idx:31 value:-nan
Mon Jul 24 19:59:37 2023[1,0]:In block 0, there has 3,0,36 nan,inf,num
Mon Jul 24 19:59:37 2023[1,0]:Error: /paddle/paddle/fluid/framework/details/nan_inf_utils_detail.cu:105 Assertion :terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
Mon Jul 24 19:59:37 2023[1,0]: what(): (External) CUDA error(719), unspecified launch failure.
Mon Jul 24 19:59:37 2023[1,0]: [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc:80)
Mon Jul 24 19:59:37 2023[1,0]:
Mon Jul 24 19:59:37 2023[1,0]:
Mon Jul 24 19:59:37 2023[1,0]:
Mon Jul 24 19:59:37 2023[1,0]:--------------------------------------
Mon Jul 24 19:59:37 2023[1,0]:C++ Traceback (most recent call last):
Mon Jul 24 19:59:37 2023[1,0]:--------------------------------------
Mon Jul 24 19:59:37 2023[1,0]:0 egr::Backward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool)
Mon Jul 24 19:59:37 2023[1,0]:1 egr::RunBackward(std::vector<paddle::experimental::Tensor, std::allocator > const&, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&, bool, std::vector<paddle::experimental::Tensor, std::allocator > const&)
Mon Jul 24 19:59:37 2023[1,0]:2 egr::GradNodeAccumulation::operator()(paddle::small_vector<std::vector<paddle::experimental::Tensor, std::allocator >, 15u>&, bool, bool)
Mon Jul 24 19:59:37 2023[1,0]:3 egr::GradNodeAccumulation::ApplyReduceHooks()
Mon Jul 24 19:59:37 2023[1,0]:4 paddle::distributed::EagerReducer::AddDistHook(unsigned long)
Mon Jul 24 19:59:37 2023[1,0]:5 paddle::distributed::EagerReducer::MarkVarReady(unsigned long, bool)
Mon Jul 24 19:59:37 2023[1,0]:6 paddle::distributed::EagerReducer::FinalizeBackward()
Mon Jul 24 19:59:37 2023[1,0]:7 paddle::experimental::Tensor::reset()
Mon Jul 24 19:59:37 2023[1,0]:8 std::_Sp_counted_ptr_inplace<phi::DenseTensor, std::allocator, (gnu_cxx::_Lock_policy)2>::_M_dispose()
Mon Jul 24 19:59:37 2023[1,0]:9 std::_Sp_counted_deleter<phi::Allocation, std::function<void (phi::Allocation)>, std::allocator, ( gnu_cxx::_Lock_policy)2>::_M_dispose()
Mon Jul 24 19:59:37 2023[1,0]:10 paddle::memory::allocation::StatAllocator::FreeImpl(phi::Allocation)
Mon Jul 24 19:59:37 2023[1,0]:11 paddle::memory::allocation::RetryAllocator::FreeImpl(phi::Allocation )
Mon Jul 24 19:59:37 2023[1,0]:
Mon Jul 24 19:59:37 2023[1,0]:----------------------
Mon Jul 24 19:59:37 2023[1,0]:Error Message Summary:
Mon Jul 24 19:59:37 2023[1,0]:----------------------
Mon Jul 24 19:59:37 2023[1,0]:FatalError: : [TimeInfo: Aborted at 1690199976 (unix time) try "date -d @1690199976" if you are using GNU date ]
Mon Jul 24 19:59:37 2023[1,0]: [SignalInfo: SIGABRT (@0x4a22) received by PID 18978 (TID 0x7f1d3be0d640) from PID 18978 ]
Mon Jul 24 19:59:37 2023[1,0]:oot/paddlejob/workspace/upload_tools', 'PMIX_SERVER_TMPDIR': '/tmp/ompi.yq01-sys-hic-k8s-v100-box-a225-0389.yq01.baidu.com.0/pid.18817/0/0', 'IREPO_INIT_MODEL_REVISION': '', 'DEV_DATA_ID': '', 'BROWSER': '/root/code-server/lib/vscode/bin/helpers/browser.sh', 'PADDLE_TRAINING_ROLE': 'TRAINER', 'OMPI_COMM_WORLD_NODE_RANK': '0', 'THIRDPARTY_PATH': '', 'VSCODE_GIT_ASKPASS_NODE': '/root/code-server/lib/node', 'GIT_ASKPASS': '/root/code-server/lib/vscode/extensions/git/dist/askpass.sh', 'HUB_SERVER': 'http://gzbh-aip-paddlehub01.gzbh.baidu.com:8888/paddlehub;http://gzbh-aip-paddlehub02.gzbh.baidu.com:8888/paddlehub', 'K8S_TRAINERS_COUNT': '1', 'SYS_EXP_TOKEN': 'e1adece161b93e74f343771875605a93', 'OMPI_NUM_APP_CTX': '1', 'IREPO_INIT_MODEL_REPONAME': '', 'SYS_RDMA_TCP': '', 'DICT_PATH': '', 'NODE_EXEC_PATH': '/root/code-server/lib/node', 'KUBERNETES_PORT_443_TCP_ADDR': '11.1.0.1', 'MOUNT_AFS': 'true', 'TRAININGJOB_POD_TYPE': 'normal', 'TRAINER_PORTS': '35368,35369', 'SYS_NCCL_CHECK': '0', 'TRAININGJOB_SERVICE': '10.127.19.149', 'UPLOAD_STATUS_AK': 'eb8a2a50d44a5a9e8e77896d1670302c', 'STDERR_LOG_PATH': '/root/paddlejob/workspace/log/err.log', 'TRAININGJOB_NAMESPACE': 'group-1f8cc06f-1968-dbae-e20c-7fa9216c1971', 'NVIDIA_VISIBLE_GPUS_SLOT': '4,5', 'WEBIDE_CLOUD_SERVER_URL': 'http://codelab.baidu-int.com/cloud', 'JOB_HEARTBEAT_INTERVAL': '25', 'NCCL_IB_QPS_PER_CONNECTION': '8', 'KUBERNETES_PORT_443_TCP': 'tcp://11.1.0.1:443', 'PADDLE_EDL_AUTO_CHECKPOINT_FLAG': '1', 'TRAIN_DATA_PATH': '', 'SYS_HYPER_PARAMS': '', 'START_CMD': 'sh submit/start_cmd.sh', 'PADDLE_NUM_GRADIENT_SERVERS': '1', 'FLAGS_cudnn_exhaustive_search': 'True', 'OMPI_MCA_orte_local_daemon_uri': '3340042240.0;tcp://10.127.19.149,192.168.5.1:10114', 'HFI_NO_BACKTRACE': '1', 'INIT_MODEL_PATH': '', 'PMIX_SERVER_URI2': '3340042240.0;tcp4://127.0.0.1:10000', 'DISTILLATION_DATA_PATH': '', 'SYS_NICS': '', 'COLORTERM': 'truecolor', 'POD_IP': '10.127.19.149', 'IREPO_URL': 'http://irepo.baidu-int.com', 'SYS_BASE_FILEPATH': '/user/paddlecloud/paddle-platform', 'K8S_PSERVERSCOUNT': '0', '': '/usr/local/bin/python3', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/usr/local/lib/python3.7/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/usr/local/lib/python3.7/site-packages/cv2/qt/fonts', 'POD_NAME': 'cvznnn', 'PADDLE_MASTER': '10.127.19.149:41403', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'}
Mon Jul 24 19:59:37 2023[1,0]:LAUNCH INFO 2023-07-24 19:59:37,828 ------------------------- ERROR LOG DETAIL -------------------------
Mon Jul 24 19:59:37 2023[1,0]:
Mon Jul 24 19:59:38 2023[1,0]:LAUNCH INFO 2023-07-24 19:59:38,429 Exit code -6
false
failed. ===ERROR: in [op=gather] [tensor=] find nan or inf=== Mon Jul 24 19:59:36 2023[1,0]Process abort signal
is detected by the operating system. Mon Jul 24 19:59:36 2023[1,0]false
failed. ===ERROR: in [op=gather] [tensor=] find nan or inf=== Mon Jul 24 19:59:37 2023[1,0]Process abort signal
is detected by the operating system. Mon Jul 24 19:59:37 2023[1,0]Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
orterun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[50965,1],0] Exit code: 250