Closed cdavro closed 2 years ago
This means conda failed to find either your glibc or CUDA driver. To override them, refer https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html#overriding-detected-packages. But note that you still need them during runtime.
Ok thank you.
1. It was the cuda missing, the export CONDA_OVERRIDE_CUDA methods worked. (I am installing connected to a login node and not a gpu node). No problem with glibc as the 2.28 is installed. I can now do the offline package with success.
So I assume the previous version (2.0.3 with TF2.5) there was no CUDA checks at all?
2.
The online version with
conda create -n deepmd deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org
works but installs: deepmd-kit-2.0.3-py39_1_cuda10.1_gpu, tensorflow-2.7.0-cuda101py39h3452394_0 and libdeepmd-1.2.0-0_cuda10.1_gpu, libtensorflow_cc-2.1.0-gpu_cuda10.1_0
And if i restrict the version (2.0.3):
conda create -n deepmd deepmd-kit=2.0.3=*gpu libdeepmd=2.0.3=*gpu lammps-dp=2.0.2 cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org
Here is my conda info, glib is present and detected, and for cuda either on a gpu node (present + detected) or with $CONDA_OVERRIDE_CUDA (just 'detected'), same conflicts.
Is there someting I totaly missed?
The following is my errors:
(/scratch/gpfs/yifanl/usr/licensed/anaconda3/2020.7/dpdev3) [yifanl@tigercpu /scratch/gpfs/yifanl/Softwares/deepmd-kit/deepmd-kit-Dec21v1]$conda install -c deepmodeling libtensorflow_cc==cuda11 Collecting package metadata (current_repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source. Collecting package metadata (repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Solving environment: / Found conflicts! Looking for incompatible packages. This can take several minutes. Press CTRL-C to abort. failed
ResolvePackageNotFound:
@cdavro I cannot reproduce your conflict. Here's my output:
conda create -n deepmd_tst deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org --dry-run
I did exactly the same command:
conda create -n deepmd_tst deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org --dry-run
On one cluster:
Same for the other one.
In both, I have strict channel priority and only defaults in my .condarc
If I do
conda create -n deepmd_tst deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org --no-channel-priority --dry-run
I have the same results as you.
I assume you can reproduce my results with
conda create -n deepmd_tst deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org --no-channel-priority --dry-run
(Where you end up with no plumed, and a libdeepmd 1.2.0, libtensorflowcc 2.1.0)
Also for the CPU version:
conda create -n deepmd_tst deepmd-kit=*=*cpu libdeepmd=*=*cpu lammps-dp -c https://conda.deepmodeling.org --dry-run --strict-channel-priority
vs
conda create -n deepmd_tst deepmd-kit=*=*cpu libdeepmd=*=*cpu lammps-dp -c https://conda.deepmodeling.org --dry-run --no-channel-priority
So strict gives again the libdeepmd 1.2.0 + libtensorflow_cc 2.1.0 + no plumed whereas the no-channel gives the tensforflow 2.5 (shouldnt it be 2.7 like the gpu?)
Hi,
I was trying to install the GPU version of deepmd suite on a headnode of computing cluster. After many attempts, I finally installed it successfully.
I had to set the --no-channel-priority
flag during the installation. Otherwise, I will have the conflict message even after I set.
export CONDA_OVERRIDE_GLIBC=""
export CONDA_OVERRIDE_CUDA=""
Here is my conda info:
active env location : /home/jiankunp/softwares/miniconda3/envs/deepmd
shell level : 3
user config file : /home/jiankunp/.condarc
populated config files : /home/jiankunp/.condarc
conda version : 4.12.0
conda-build version : not installed
python version : 3.9.10.final.0
virtual packages : __linux=3.10.0=0
__unix=0=0
__archspec=1=x86_64
base environment : /home/jiankunp/softwares/miniconda3 (writable)
conda av data dir : /home/jiankunp/softwares/miniconda3/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /home/jiankunp/softwares/miniconda3/pkgs
/home/jiankunp/.conda/pkgs
envs directories : /home/jiankunp/softwares/miniconda3/envs
/home/jiankunp/.conda/envs
platform : linux-64
user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.10 Linux/3.10.0-1160.59.1.el7.x86_64 centos/7.9.2009 glibc/2.17
UID:GID : 1210:1212
netrc file : /home/jiankunp/.netrc
offline mode : False
Offline conda installation (CUDA 10.1 or 11.3) have many conflicts (since the change from TF 2.5 (pkgs/main) to TF 2.7 (deepmodeling)). Same for online ones.
Linux RHEL 8.2 (x64 intel) Conda 4.11.0
Deepmd-kit version: 2.0.3 Installation way: 1. Offline packages gpu cuda 10.1 or 11.3:
2. conda create -n deepmd_2.0.3 deepmd-kit==gpu libdeepmd==gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org or conda create -n deepmd_2.0.3 deepmd-kit==gpu libdeepmd==gpu lammps-dp cudatoolkit=11.3 horovod -c https://conda.deepmodeling.org
Same error when using channel_priority: strict. Works when set to flexible but install TF 2.5 (base from deepmodeling). (channels in .condarc are only defaults)
Any CPU (offline or online) works flawlessly and install TF 2.7 from deepmodeling.
Old v2.0.3 offline (https://github.com/deepmd-kit-recipes/installer/releases) and not V2.0.3-1 workds (but install TF 2.5)
conda create -n test cudatoolkit==10.1.243=h6bb024c_0 works conda create -n test tensorflow==2.7.0=cuda101py39h3452394_0 -c https://conda.deepmodeling.org => no All seems to boil down to this:
It seems very strange.