deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.5k stars 511 forks source link

The unexpcted error: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal from `dp compress` command. #3764

Open robinzyb opened 6 months ago

robinzyb commented 6 months ago

Summary

The unexpcted error: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal from dp compress command.

DeePMD-kit Version

2.2.9

Backend and its version

Tensorflow v2.9.0

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

GPU P100

Details

The error is shown after the command dp compress. The content of stdout is pasted below. Note that In the processes ofTraning and frozen, the gpu device can be found without errors.

/scratch/snx3000/zyongbin/05.CLL_v5/bivo4-metad/iters-001/train-deepmd/tasks/000 /scratch/snx3000/zyongbin/05.CLL_v5/bivo4-metad/iters-001/train-deepmd/tasks
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd_utils/utils/compat.py:362: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead.
  warnings.warn(
DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD INFO    training data with min nbor dist: 0.8220611747649927
DEEPMD INFO    training data with max nbor size: [14 77 54 15]
DEEPMD INFO     _____               _____   __  __  _____           _     _  _   
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |  
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_ 
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_ 
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    Zeng et al, J. Chem. Phys., 159, 054801 (2023)
DEEPMD INFO    See https://deepmd.rtfd.io/credits/ for details.
DEEPMD INFO    installed to:         /users/zyongbin/miniconda3/envs/deepmd
DEEPMD INFO    source :              v2.2.9
DEEPMD INFO    source brach:         HEAD
DEEPMD INFO    source commit:        be437483
DEEPMD INFO    source commit at:     2024-02-04 13:44:12 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build variant:        cuda
DEEPMD INFO    build with tf inc:    /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/include/;/users/zyongbin/miniconda3/envs/deepmd/include
DEEPMD INFO    build with tf lib:    
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           nid03448
DEEPMD INFO    computing device:     gpu:0
DEEPMD INFO    CUDA_VISIBLE_DEVICES: 0
DEEPMD INFO    Count of visible GPU: 1
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
DEEPMD INFO    training without frame parameter
DEEPMD INFO    data stating... (this step may take long time)
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 2000, decay_rate 0.944061, final lr will be 1.00e-08
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py:1198: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.

WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py:1198: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.

DEEPMD INFO    batch     100 training time 6.68 s, testing time 0.01 s, total wall time 9.19 s
DEEPMD INFO    batch     200 training time 3.75 s, testing time 0.01 s, total wall time 3.79 s
DEEPMD INFO    batch     300 training time 3.80 s, testing time 0.01 s, total wall time 3.83 s
DEEPMD INFO    batch     400 training time 3.78 s, testing time 0.01 s, total wall time 3.82 s
DEEPMD INFO    batch     500 training time 3.79 s, testing time 0.01 s, total wall time 3.82 s
DEEPMD INFO    batch     600 training time 3.76 s, testing time 0.01 s, total wall time 3.80 s
DEEPMD INFO    batch     700 training time 3.78 s, testing time 0.01 s, total wall time 3.81 s
DEEPMD INFO    batch     800 training time 3.77 s, testing time 0.01 s, total wall time 3.81 s
DEEPMD INFO    batch     900 training time 3.77 s, testing time 0.01 s, total wall time 3.81 s
DEEPMD INFO    batch    1000 training time 3.77 s, testing time 0.01 s, total wall time 3.81 s
----
delete manually for clearance 
----
DEEPMD INFO    saved checkpoint model.ckpt
DEEPMD INFO    average training time: 0.0378 s/batch (exclude first 100 batches)
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 15699.768 s
/scratch/snx3000/zyongbin/05.CLL_v5/bivo4-metad/iters-001/train-deepmd/tasks
/scratch/snx3000/zyongbin/05.CLL_v5/bivo4-metad/iters-001/train-deepmd/tasks/000 /scratch/snx3000/zyongbin/05.CLL_v5/bivo4-metad/iters-001/train-deepmd/tasks
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD WARNING The following nodes are not in the graph: {'fitting_attr/aparam_nall', 'spin_attr/ntypes_spin'}. Skip freezeing these nodes. You may be freezing a checkpoint generated by an old version.
DEEPMD INFO    The following nodes will be frozen: ['train_attr/min_nbor_dist', 'o_virial', 'o_atom_energy', 'model_attr/tmap', 'descrpt_attr/rcut', 't_mesh', 'o_energy', 'o_atom_virial', 'train_attr/training_script', 'o_force', 'fitting_attr/daparam', 'model_attr/model_version', 'model_type', 'descrpt_attr/ntypes', 'fitting_attr/dfparam', 'model_attr/model_type']
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py:370: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.convert_variables_to_constants`
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py:370: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.convert_variables_to_constants`
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/convert_to_constants.py:925: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/convert_to_constants.py:925: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
DEEPMD INFO    3408 ops in the final graph.
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD INFO    

DEEPMD INFO    stage 1: compress the model
DEEPMD WARNING Switch to serial execution due to lack of horovod module.
Traceback (most recent call last):
  File "/users/zyongbin/miniconda3/envs/deepmd/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd_utils/main.py", line 656, in main
    deepmd_main(args)
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 82, in main
    compress(**dict_args)
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/compress.py", line 150, in compress
    train(
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 119, in train
    run_opt = RunOptions(
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/run_options.py", line 120, in __init__
    self._try_init_distrib()
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/run_options.py", line 220, in _try_init_distrib
    self._init_serial()
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/run_options.py", line 255, in _init_serial
    nodename, _, gpus = get_resource()
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/cluster/__init__.py", line 26, in get_resource
    return get_slurm_res()
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/cluster/slurm.py", line 58, in get_resource
    gpus = local.get_gpus()
  File "/users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/cluster/local.py", line 46, in get_gpus
    raise RuntimeError("Failed to detect availbe GPUs due to:\n%s" % decoded)
RuntimeError: Failed to detect availbe GPUs due to:
2024-05-09 21:50:40.245869: W tensorflow/stream_executor/cuda/cuda_driver.cc:374] A non-primary context 0x43f64c0 for device 0 exists before initializing the StreamExecutor. The primary context is now 0x4047470. We haven't verified StreamExecutor works with that.
2024-05-09 21:50:40.245972: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

Batch Job Summary Report (version 21.01.1) for Job "dpmd" (53342807) on daint

Job information (1/3)
-----------------------------------------------------------------------------------------------------
             Submit            Eligible               Start                 End    Elapsed Time limit
------------------- ------------------- ------------------- ------------------- ---------- ----------
2024-05-09T17:24:59 2024-05-09T17:25:00 2024-05-09T17:26:49 2024-05-09T21:50:42   04:23:53 1-00:00:00
-----------------------------------------------------------------------------------------------------

Job information (2/3)
-------------------------------------------------------------
    Username      Account    Partition   NNodes        Energy
------------ ------------ ------------ -------- -------------
    zyongbin        s1123       normal        1   2601.007 kJ

Job information (3/3) - GPU utilization data
----------------------------------------------------
   Node name       Usage      Max mem Execution time
------------ ----------- ------------ --------------
    nid03448        80 %     1497 MiB       04:21:53
robinzyb commented 6 months ago

contents of slurm file

#!/bin/bash
#!/bin/bash -l
#SBATCH --job-name="dpmd"
#SBATCH --account="s1123"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=yongbin.zhuang@epfl.ch
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12
#SBATCH --cpus-per-task=1
#SBATCH --partition=normal
#SBATCH --constraint=gpu                                              

set -e                                                                      
module load daint-gpu
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CRAY_CUDA_MPS=1
export TF_INTRA_OP_PARALLELISM_THREADS=1
export TF_INTER_OP_PARALLELISM_THREADS=1
export CUDA_VISIBLE_DEVICES=0
ulimit -s unlimited
source ~/.bashrc
conda activate deepmd                                   
set +e                                                                      

pushd /scratch/snx3000/zyongbin/05.CLL_v5/bivo4-metad/iters-001/train-deepmd/tasks/001 || exit 1
if [ -f dp-train.checkpoint ]; then echo 'hit dp-train.checkpoint, skip'; else
################################################################################
if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi
################################################################################
__EXITCODE__=$?; if [ $__EXITCODE__ -ne 0 ]; then exit $__EXITCODE__; fi
touch dp-train.checkpoint; fi  # create checkpoint on success
popd

pushd /scratch/snx3000/zyongbin/05.CLL_v5/bivo4-metad/iters-001/train-deepmd/tasks/001 || exit 1
################################################################################
dp freeze -o original_model.pb && dp compress -i original_model.pb -o frozen_model.pb
################################################################################
__EXITCODE__=$?; if [ $__EXITCODE__ -ne 0 ]; then exit $__EXITCODE__; fi
popd

echo $SLURM_JOB_ID > job-8kYKJRN1URkE4OikNSu9Y20zWx.sbatch.success

echo $SLURM_JOB_ID > job-8kYKJRN1URkE4OikNSu9Y20zWx.sbatch.success
njzjz commented 6 months ago

I see some similar issues: