Question at the step of generating simulated AFM images

HuangJiaLian commented 1 year ago

Hi, @NikoOinonen

I attempted to execute this code and followed through with the README instructions. Setting up the graph-afm conda environment and performing the build step went smoothly without any issues. However, I encountered a problem when generating the simulated AFM images.

Here is what I have done at the HPC:

(base) [huangj4@login3 Graph-AFM]$ conda activate graph-afm
(graph-afm) [huangj4@login3 Graph-AFM]$ ls
build.sh  environment_exact.yml  model_schem.png     ProbeParticleModel  scripts
data      environment.yml        pretrained_weights  README.md           src
(graph-afm) [huangj4@login3 Graph-AFM]$ cd scripts/
(graph-afm) [huangj4@login3 scripts]$ ls
generate_data.py     predict_random.py  train_distributed.py
predict_examples.py  test.py            train.py
(graph-afm) [huangj4@login3 scripts]$ python generate_data.py
 PACKAGE_PATH =  /home/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 CPP_PATH     =  /home/huangj4/Github/Graph-AFM/ProbeParticleModel/cpp
No CUDA runtime is found, using CUDA_HOME='/home/huangj4/.conda/envs/graph-afm'
OCLEnvironment platform[0]  PACKAGE_PATH:  /home/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
Traceback (most recent call last):
  File "generate_data.py", line 44, in <module>
    env = oclu.OCLEnvironment( i_platform = 0 )
  File "/home/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/oclUtils.py", line 20, in __init__
    platforms         = cl.get_platforms()
pyopencl._cl.LogicError: clGetPlatformIDs failed: PLATFORM_NOT_FOUND_KHR

The outputs say "No CUDA runtime is found", so I loaded the cuda module. But it seems no effect.

(graph-afm) [huangj4@login3 scripts]$ module load cuda
(graph-afm) [huangj4@login3 scripts]$ python generate_data.py
 PACKAGE_PATH =  /home/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 CPP_PATH     =  /home/huangj4/Github/Graph-AFM/ProbeParticleModel/cpp
No CUDA runtime is found, using CUDA_HOME='/share/apps/spack/envs/fgci-centos7-haswell/software/cuda/12.0.0/32o5w4t'
OCLEnvironment platform[0]  PACKAGE_PATH:  /home/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
Traceback (most recent call last):
  File "generate_data.py", line 44, in <module>
    env = oclu.OCLEnvironment( i_platform = 0 )
  File "/home/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/oclUtils.py", line 20, in __init__
    platforms         = cl.get_platforms()
pyopencl._cl.LogicError: clGetPlatformIDs failed: PLATFORM_NOT_FOUND_KHR

Could you please give me some advice on how to deal with this problem? Thank you.

Jie

NikoOinonen commented 1 year ago

The error seems to be indicating that you don't have an OpenCL device driver installed (and the simulation code is written in OpenCL). If you have some OpenCL capable device (a GPU) on the system, then the ocl-icd-system package in conda should detect the driver automatically.

But sometimes that might not work on some super computers etc., in which case I have just manually copied the driver file to the conda environment (run while conda environment is activated):

cp /etc/OpenCL/vendors/nvidia.icd $CONDA_PREFIX'/etc/OpenCL/vendors/'

That's for Nvidia. Depending on what you have on your system, you might have to check what other *.icd files are present in /etc/OpenCL/vendors/.

HuangJiaLian commented 1 year ago

Thank you for the explanation. I will try to learn more about how to use GPU on the Triton HPC cluster to handle this problem later.

HuangJiaLian commented 1 year ago

I tried to login into a Tesla V100 GPU node, and some other errors occurred.

This is what I've done before generating data

(base) [huangj4@login3 huangj4]$ srun -p interactive --gres=gpu:1 --constraint=volta --time=2:00:00 --mem=6000M --pty bash
(base) [huangj4@gpu32 Graph-AFM]$ conda activate graph-afm
(/scratch/work/huangj4/.conda_envs/graph-afm) [huangj4@gpu32 Graph-AFM]$ cd scripts/
(/scratch/work/huangj4/.conda_envs/graph-afm) [huangj4@gpu32 scripts]$ ls
generate_data.py     predict_random.py   slurm-17989007.out  test.py               train.py
predict_examples.py  slurm-17988316.out  submit.sh           train_distributed.py

Errors

(/scratch/work/huangj4/.conda_envs/graph-afm) [huangj4@gpu32 scripts]$ python generate_data.py
 PACKAGE_PATH =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 CPP_PATH     =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cpp
OCLEnvironment platform[0]  PACKAGE_PATH:  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 i_platform  0
3 errors generated.
Traceback (most recent call last):
  File "generate_data.py", line 46, in <module>
    oclr.init(env)
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/RelaxOpenCL.py", line 52, in init
    cl_program  = env.loadProgram(env.CL_PATH+"/relax.cl")
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/oclUtils.py", line 30, in loadProgram
    program = cl.Program(self.ctx, fstr ).build()
  File "/scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/__init__.py", line 534, in build
    self._prg, was_cached = self._build_and_catch_errors(
  File "/scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/__init__.py", line 582, in _build_and_catch_errors
    raise err
pyopencl._cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE

Build on <pyopencl.Device 'Tesla V100-PCIE-32GB' on 'NVIDIA CUDA' at 0x557be9ed4a10>:

<kernel>:377:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:377:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~
<kernel>:404:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:404:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~
<kernel>:481:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:481:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~

(options: -I /scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/cl -I/scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cl -I/scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cl)
(source saved as /tmp/tmpejtowm75.cl)

Then I tried to copy the driver to my conda environment

(/scratch/work/huangj4/.conda_envs/graph-afm) [huangj4@gpu32 scripts]$ cp /etc/OpenCL/vendors/nvidia.icd $CONDA_PREFIX'/etc/OpenCL/vendors/'
(/scratch/work/huangj4/.conda_envs/graph-afm) [huangj4@gpu32 scripts]$ ls $CONDA_PREFIX'/etc/OpenCL/vendors/'
nvidia.icd  ocl-icd-system

The problems still exist

(/scratch/work/huangj4/.conda_envs/graph-afm) [huangj4@gpu32 scripts]$ python generate_data.py
 PACKAGE_PATH =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 CPP_PATH     =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cpp
OCLEnvironment platform[0]  PACKAGE_PATH:  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 i_platform  0
3 errors generated.
Traceback (most recent call last):
  File "generate_data.py", line 46, in <module>
    oclr.init(env)
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/RelaxOpenCL.py", line 52, in init
    cl_program  = env.loadProgram(env.CL_PATH+"/relax.cl")
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/oclUtils.py", line 30, in loadProgram
    program = cl.Program(self.ctx, fstr ).build()
  File "/scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/__init__.py", line 534, in build
    self._prg, was_cached = self._build_and_catch_errors(
  File "/scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/__init__.py", line 582, in _build_and_catch_errors
    raise err
pyopencl._cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE

Build on <pyopencl.Device 'Tesla V100-PCIE-32GB' on 'NVIDIA CUDA' at 0x563a0dcd0fb0>:

<kernel>:377:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:377:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~
<kernel>:404:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:404:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~
<kernel>:481:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:481:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~

(options: -I /scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/cl -I/scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cl -I/scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cl)
(source saved as /tmp/tmpxxiyd91v.cl)

I don't think I need to change the codes itself. It's probably because I use some libraries in the wrong version. So I change the opencl to the version in 2021 by using

pip install pyopencl==2021.2.6

and load the cuda version as described in environment.yml

module load cuda/11.3.1

(/scratch/work/huangj4/.conda_envs/graph-afm) [huangj4@gpu32 scripts]$ python generate_data.py
 PACKAGE_PATH =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 CPP_PATH     =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cpp
OCLEnvironment platform[0]  PACKAGE_PATH:  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 i_platform  0
3 errors generated.
Traceback (most recent call last):
  File "generate_data.py", line 46, in <module>
    oclr.init(env)
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/RelaxOpenCL.py", line 52, in init
    cl_program  = env.loadProgram(env.CL_PATH+"/relax.cl")
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/oclUtils.py", line 30, in loadProgram
    program = cl.Program(self.ctx, fstr ).build()
  File "/scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/__init__.py", line 536, in build
    self._prg, was_cached = self._build_and_catch_errors(
  File "/scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/__init__.py", line 584, in _build_and_catch_errors
    raise err
pyopencl._cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE

Build on <pyopencl.Device 'Tesla V100-PCIE-32GB' on 'NVIDIA CUDA' at 0x564d002e0e50>:

<kernel>:377:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:377:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~
<kernel>:404:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:404:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~
<kernel>:481:42: error: cannot assign to variable 'dpos0_' with const-qualified type 'const float4' (vector of 4 'float' values)
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
                               ~~~~~~~~~~^
<kernel>:481:18: note: variable 'dpos0_' declared const here
    const float4 dpos0_=dpos0; dpos0_.xyz= rotMatT( dpos0_.xyz , tipA.xyz, tipB.xyz, tipC.xyz );
    ~~~~~~~~~~~~~^~~~~~~~~~~~

(options: -I /scratch/work/huangj4/.conda_envs/graph-afm/lib/python3.8/site-packages/pyopencl/cl -I/scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cl -I/scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cl)
(source saved as /tmp/tmpmrxsylw4.cl)

However, these errors are unsolved. Are there some steps I did wrong? Thank you @NikoOinonen

NikoOinonen commented 1 year ago

This didn't use to be a problem with Nvidia devices, but I ran into this problem before with some other platforms and actually fixed it already in https://github.com/Probe-Particle/ppafm/commit/99c152328808989f7a1f6206159b0d28cb03c17a. But this commit is a later one than the one pointed to by the README. Fortunately there does not seem to be any changes between those commits that would affect this repo, so it should work if you replace the ProbeParticleModel version with

git clone https://github.com/ProkopHapala/ProbeParticleModel.git
cd ProbeParticleModel
git checkout 99c152328808989f7a1f6206159b0d28cb03c17a

HuangJiaLian commented 1 year ago

@NikoOinonen

After running above commands and specifying the numpy version from numpy to numpy=1.21.4 in the file environment.yml, to recreate the conda environment graph-afm, the problem was solved. 🎉

Without numpy version specification, this error comes:

 PACKAGE_PATH =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 CPP_PATH     =  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/cpp
OCLEnvironment platform[0]  PACKAGE_PATH:  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle
 i_platform  0
loadSpecies from :  /scratch/work/huangj4/Github/Graph-AFM/ProbeParticleModel/pyProbeParticle/defaults/atomtypes.ini
Traceback (most recent call last):
  File "generate_data.py", line 72, in <module>
    afmulator = AFMulator(**afmulator_args)
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/AFMulatorOCL_Simple.py", line 79, in __init__
    self.typeParams = hl.loadSpecies('atomtypes.ini')
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/HighLevelOCL.py", line 46, in loadSpecies
    return PPU.loadSpeciesLines( str_Species.split('\n') )
  File "/scratch/work/huangj4/Github/Graph-AFM/scripts/../ProbeParticleModel/pyProbeParticle/common.py", line 235, in loadSpeciesLines
    return np.array( params, dtype=[('rmin',np.float64),('epsilon',np.float64),('alpha',np.float64),('atom',np.int),('symbol', '|S10')])
  File "/scratch/work/huangj4/.conda_envs/graph-afm-test/lib/python3.8/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

SINGROUP / Graph-AFM