DeepRec-AI / DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
Apache License 2.0
1.05k stars 354 forks source link

multi-machine, multi-gpu sok core dump #838

Open wangcaihua opened 1 year ago

wangcaihua commented 1 year ago

System information

Describe the current behavior [1,2]:[n193-019-222:14623] [ 1] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0xb6)[0x7fb6a01cf826] [1,2]:[n193-019-222:14623] [ 2] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(+0x921e13)[0x7fb6a01c5e13] [1,2]:[n193-019-222:14623] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fb789fe2090] [1,2]:[n193-019-222:14623] [ 4] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(_ZNSt8__detail9_Map_baseIN4core6DeviceESt4pairIKS2_St10shared_ptrINS1_12IStorageImplEEESaIS8_ENS_10_Select1stESt8equal_toIS2_ESt4hashIS2_ENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_HashtabletraitsILb1ELb0ELb1EEELb1EEixERS4+0x173)[0x7fb5a5025e43] [1,2]:[n193-019-222:14623] [ 5] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(_ZN4core10BufferImpl7reserveERKNS_5ShapeENS_6DeviceENS_8DataTypeEm+0x313)[0x7fb5a5025143] [1,2]:[n193-019-222:14623] [ 6] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libembedding.so(_ZN9embedding33UniformModelParallelEmbeddingMetaC1ESt10shared_ptrIN4core19CoreResourceManagerEERKNS_24EmbeddingCollectionParamEm+0x2559)[0x7fb5a3627879] [1,2]:[n193-019-222:14623] [ 7] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow23EmbeddingCollectionBaseIxxfE11update_metaESt10shared_ptrIN4core19CoreResourceManagerEEiRSt6vectorIiSaIiEE+0x131)[0x7fb5a30162e1] [1,2]:[n193-019-222:14623] [ 8] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow30LookupForwardEmbeddingVarGPUOpIxxfE7ComputeEPNS_15OpKernelContextE+0x891)[0x7fb5a303d9f1] [1,2]:[n193-019-222:14623] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xdc)[0x7fb6a1fa3bbc] [1,2]:[n193-019-222:14623] [10] [n193-019-222:14623] [ 0] [1,4]:[n193-019-222:14625] Process received signal Describe the expected behavior

Code to reproduce the issue

  1. the model we use is modelzoo/deepfm, with no code modify
  2. we use mpi to run, the command is as following mpirun -np 16 --map-by ppr:4:socket -bind-to socket --hostfile ./hostfile --allow-run-as-root --tag-output --report-bindings --mca pml ob1 --mca btl ^openib --mca btl_tcp_if_exclude lo,docker0,bond0 --wdir /home/tiger/deeprec -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_HCA=mlx5 -x NCCL_DEBUG=INFO -x NCCL_IB_TIMEOUT=25 -x NCCL_IB_RETRY_CNT=7 -x NCCL_SOCKET_IFNAME=eth0 -x HOROVOD_MPI_THREADS_DISABLE=0 -x TF_GPU_CUPTI_FORCE_CONCURRENT_KERNEL=1 -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES -x NV_LIBCUBLAS_DEV_PACKAGE_NAME -x HTTPS_PROXY -x TOTAL_ORACLES -x NV_LIBCUBLAS_PACKAGE -x GLOG_log_dir -x NV_LIBNCCL_DEV_PACKAGE_VERSION -x YARN_APP_ID -x NM_LABEL -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_POD -x OOM_LISTEN_MODE -x SEC_TOKEN_PATH -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_PORT -x NVIDIA_PRODUCT_NAME -x PRIMUS_AM_RPC_PORT -x NV_LIBCUSPARSE_DEV_VERSION -x NUM_OF_PRIMUS_worker -x YARN_CONTAINER_RUNTIME_DOCKER_IMAGE -x NV_CUDNN_VERSION -x NV_LIBNPP_DEV_VERSION -x CUDA_VERSION -x PATH -x HTTP_PROXY -x NV_LIBNPP_DEV_PACKAGE -x API_SERVER_PORT -x NV_CUDNN_PACKAGE_NAME -x PRIMUS_ROLE_CATEGORY -x YARN_CLASS_ID -x LIBHDFS_OPTS -x ENV_DOCKER_CONTAINER_SECURITY_OPTION -x NV_LIBNCCL_DEV_PACKAGE_NAME -x ENABLE_OOM_LISTENER -x NM_PORT -x API_SERVER_HOST -x NCCL_VERSION -x NM_HTTP_PORT -x NV_LIBNCCL_PACKAGE_VERSION -x YARN_APP_PRIORITY -x YARN_APP_TYPE -x START_STATISTIC_STEP -x NVIDIA_DRIVER_CAPABILITIES -x TZ -x SHUFFLE_DISK_MANAGER_PORT -x YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS -x NM_AUX_SERVICE_mapreduce_shuffle -x SEC_KV_AUTH -x TF_SCRIPT -x CLASSPATH -x LOCAL_DIRS -x HADOOP_YARN_HOME -x NV_LIBCUBLAS_DEV_VERSION -x HADOOP_CONF_DIR -x NO_PROXY -x LIBRARY_PATH -x NV_LIBNPP_PACKAGE -x PRIMUS_EXECUTOR_UNIQUE_ID -x PRIMUS_AM_RPC_HOST -x NV_NVPROF_DEV_PACKAGE -x NV_NVML_DEV_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_MEMORY_MB -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_BASE -x NV_CUDA_LIB_VERSION -x RUNTIME_IDC_NAME -x TF_CONFIG -x YARN_APP_TAGS -x NV_LIBCUBLAS_DEV_PACKAGE -x LC_CTYPE -x NVARCH -x NV_CUDA_CUDART_DEV_VERSION -x NLSPATH -x ENV_DOCKER_CONTAINER_SHM_SIZE -x SHLVL -x TF_WORKSPACE -x JEMALLOC_PATH -x XFILESEARCHPATH -x SPARK_3_SHUFFLE_SERVICE_PORT -x NV_LIBCUBLAS_PACKAGE_NAME -x NM_HOST -x PRIMUS_SUBMIT_TIMESTAMP -x STOP_STATISTIC_STEP -x PYTHONPATH -x NV_LIBNCCL_PACKAGE_NAME -x YARN_QUEUE_ID -x ENV_DOCKER_CONTAINER_DEVICE -x ROLES_LIST -x YARN_USER -x LOAD_SERVICE_PSM -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_GPU -x PRIMUS_EXECUTOR_UNIQID -x NV_NVPROF_VERSION -x JAVA_HOME -x NVIDIA_REQUIRE_CUDA -x YARN_CONTAINER_RUNTIME_TYPE -x SPARK_SHUFFLE_SERVICE_PORT -x ENV_DOCKER_CONTAINER_CAP_ADD -x MALLOC_ARENA_MAX -x SSD_MANAGER_PORT -x YARN_QUEUE_NAME -x NV_NVTX_VERSION -x YODEL_MODE -x NV_CUDA_CUDART_VERSION -x BYTED_HOST_IPV6 -x NV_CUDA_COMPAT_PACKAGE -x LD_LIBRARY_PATH -x HADOOP_TOKEN_FILE_LOCATION -x LOG_DIRS -x APPLICATION_ID -x HOME -x NV_LIBCUSPARSE_VERSION -x HADOOP_COMMON_HOME -x HADOOP_HDFS_HOME -x OLDPWD -x NV_LIBNCCL_PACKAGE -x MEM_USAGE_STRATEGY -x PWD -x NV_LIBCUBLAS_VERSION -x ENV_DOCKER_CONTAINER_ULIMIT -x LOGNAME -x NV_CUDNN_PACKAGE -x PRIMUS_STAGING_DIR -x NV_LIBNCCL_DEV_PACKAGE -x NVIDIA_VISIBLE_DEVICES -x NV_LIBNPP_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES_MILLI -x HADOOP_HOME -x CORE_DUMP_PROC_NAME -x NV_CUDNN_PACKAGE_DEV -x USER python3 train.py --output_dir=hdfs://harunava/user/xxx/deeprec_v10 --data_location=hdfs://harunava/user/xxx/criteo_small --protocol=grpc --smartstaged=false --batch_size=2048 --steps=30000 --ev=true --ev_elimination=l2 --ev_filter=counter --op_fusion=true --input_layer_partitioner=0 --dense_layer_partitioner=16 --group_embedding=collective --workqueue=true --parquet_dataset=false

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

liutongxuan commented 1 year ago

@Mesilenceki @shijieliu

wangcaihua commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。