deepqmc / deepqmc

Deep learning quantum Monte Carlo for electrons in real space
MIT License
358 stars 61 forks source link

Some issues when installing and running deepQMC 1.1.0 #175

Closed xiazhuozhao closed 1 year ago

xiazhuozhao commented 1 year ago

I'm glad to see you have released deepQMC v1.1.0, which resolves the dependency issues I had installing v1.0.1. However, I'm still running into some errors that I'm hoping you can help with.

System environment

CPU Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz GPU NVIDIA A800 CUDA 11.6 Linux version 3.10.0-1160.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Mon Oct 19 16:18:59 UTC 2020 Run it directly on a compute node without using slurm.

What did I do?

  1. I created and activated an new conda environment using py39 conda create --name deepqmc python==3.9 conda activate deepqmc
  2. I installed the deepqmc using pip install -U deepqmc
  3. I enabled GPU support for JAX using pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
  4. I tested deepqmc using the following command-line instructs:
    deepqmc hydra.run.dir=workdir
    deepqmc task=evaluate task.restdir=workdir

    What's wrong?

    see this:

    
    (deepqmc) [xiazhuozhao@c2 work]$ deepqmc hydra.run.dir=workdir
    WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
    I0000 00:00:1695996296.146525   50719 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created.
    2023-09-29 22:04:59.230629: W external/xla/xla/service/gpu/nvptx_compiler.cc:708] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
    [22:05:02.396] INFO:deepqmc.app: Entering application
    [22:05:02.397] INFO:deepqmc.app: Running on 2 NVIDIA GRAPHICS DEVICEs with 1 process
    [22:05:02.397] INFO:deepqmc.app: Will work in /home/xiazhuozhao/deepqmc-230929/work/workdir
    [22:05:53.240] INFO:deepqmc.train: Number of model parameters: 197589
    [22:05:54.071] INFO:deepqmc.train: Pretraining wrt. baseline wave function
    [22:05:55.188] INFO:deepqmc.wf.baseline.pyscfext: Running HF...                      
    [22:05:56.333] INFO:deepqmc.wf.baseline.pyscfext: HF energy: -7.951971538662545      
    pretrain: 100%|████████████████████| 5000/5000 [08:37<00:00,  9.66it/s, MSE=7.66e-06]
    [22:14:32.157] INFO:deepqmc.train: Pretraining completed with MSE = 7.66e-06
    [22:14:36.189] INFO:deepqmc.train: Equilibrating sampler...
    equilibrate sampler:   9%|█▏            | 86/1000 [00:11<02:04,  7.36it/s, tau=0.171]
    [22:14:47.918] INFO:deepqmc.train: Start training
    [22:15:58.579] INFO:deepqmc.train: Progress: 1/1000, energy = -7.8(1.0)              
    [22:16:01.040] INFO:deepqmc.train: Progress: 2/1000, energy = -7.816(4)              
    [22:16:16.958] INFO:deepqmc.train: Progress: 68/1000, energy = -8.0596(21)           
    [22:17:02.515] INFO:deepqmc.train: Progress: 264/1000, energy = -8.0696(10)          
    [22:18:40.220] INFO:deepqmc.train: Progress: 674/1000, energy = -8.0694(5)           
    training: 100%|████████████████████| 1000/1000 [05:10<00:00,  3.22it/s, E=-8.0690(4)]
    [22:19:58.705] INFO:deepqmc.train: The training has been completed!
    I0000 00:00:1695997199.795904   50719 tfrt_cpu_pjrt_client.cc:352] TfrtCpuClient destroyed.

(deepqmc) [xiazhuozhao@c2 work]$ deepqmc task=evaluate task.restdir=workdir WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1695997281.783160 16715 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created. 2023-09-29 22:21:25.236598: W external/xla/xla/service/gpu/nvptx_compiler.cc:708] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages. [22:21:27.980] INFO:deepqmc.app: Entering application [22:21:27.982] INFO:deepqmc.app: Running on 2 NVIDIA GRAPHICS DEVICEs with 1 process [22:21:27.982] INFO:deepqmc.app: Will work in /home/xiazhuozhao/deepqmc-230929/work/outputs/2023-09-29/22-21-27 [22:21:28.069] INFO:deepqmc.app: Found original config file in /home/xiazhuozhao/deepqmc-230929/work/workdir [22:21:30.441] INFO:deepqmc.train: Start evaluation [22:21:58.536] INFO:deepqmc.train: Progress: 1/1000, energy = -8.1(1.0)
evaluation: 0%| | 0/1000 [00:27<?, ?it/s, E=-8.1(1.0)] Error executing job with overrides: ['task=evaluate', 'task.restdir=workdir'] Error in call to target 'deepqmc.app.train_from_factories': AttributeError("'dict' object has no attribute 'ndim'")

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. I0000 00:00:1695997319.880270 16715 tfrt_cpu_pjrt_client.cc:352] TfrtCpuClient destroyed.

mmezic commented 1 year ago

Hi, thanks for the detailed report! I can reproduce this error when trying to evaluate my ansätze. I think #179 should fix it. @szbernat could you check it?

xiazhuozhao commented 1 year ago

Thank you for your efforts. I pulled the latest code from Github git clone https://github.com/deepqmc/deepqmc.git and reinstalled it pip install -e .[dev]. Now problem seems like this:

(deepqmc-1006) [xiazhuozhao@c2 deepqmc-1006]$ deepqmc hydra.run.dir=workdir
2023-10-06 14:02:58.090725: W external/xla/xla/service/gpu/nvptx_compiler.cc:702] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
[14:03:00.787] INFO:deepqmc.app: Entering application
[14:03:00.788] INFO:deepqmc.app: Running on 2 NVIDIA GRAPHICS DEVICEs with 1 process
[14:03:00.789] INFO:deepqmc.app: Will work in /home/xiazhuozhao/deepqmc-1006/workdir
[14:03:42.681] INFO:deepqmc.train: Number of model parameters: 197589
[14:03:43.096] INFO:deepqmc.train: Pretraining wrt. baseline wave function
[14:03:43.965] INFO:deepqmc.wf.baseline.pyscfext: Running HF...
[14:03:44.966] INFO:deepqmc.wf.baseline.pyscfext: HF energy: -7.951971538662545
pretrain:  48%|██████████████████████████████████████████████████████████████████████████████████▉                                                                                         | 2411/5000 [01:37<01:15, 34.14it/s, MSE=3.36e-05]2023-10-06 14:05:21.066345: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: an unsupported value or parameter was passed to the function
2023-10-06 14:05:21.066477: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: CaptureGpuGraph failed (an unsupported value or parameter was passed to the function; current tracing scope: custom-call.30): INTERNAL: Failed to end stream capture: CUDA_ERROR_STREAM_CAPTURE_INVALIDATED: operation failed due to a previous error during capture
2023-10-06 14:05:21.066696: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2711] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.graph.launch' failed: CaptureGpuGraph failed (an unsupported value or parameter was passed to the function; current tracing scope: custom-call.30): INTERNAL: Failed to end stream capture: CUDA_ERROR_STREAM_CAPTURE_INVALIDATED: operation failed due to a previous error during capture; current profiling annotation: XlaModule:#hlo_module=pmap_pretrain_step,program_id=542#.
2023-10-06 14:05:31.066924: F external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2857] Replicated computation launch failed, but not all replicas terminated. Aborting process to work around deadlock. Failure message (there may have been multiple failures, see the error log for all failures):

Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.graph.launch' failed: CaptureGpuGraph failed (an unsupported value or parameter was passed to the function; current tracing scope: custom-call.30): INTERNAL: Failed to end stream capture: CUDA_ERROR_STREAM_CAPTURE_INVALIDATED: operation failed due to a previous error during capture; current profiling annotation: XlaModule:#hlo_module=pmap_pretrain_step,program_id=542#.
Aborted
(deepqmc-1006) [xiazhuozhao@c2 deepqmc-1006]$ deepqmc hydra.run.dir=workdir
2023-10-06 14:06:09.319226: W external/xla/xla/service/gpu/nvptx_compiler.cc:702] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
[14:06:11.957] INFO:deepqmc.app: Entering application
[14:06:11.958] INFO:deepqmc.app: Running on 2 NVIDIA GRAPHICS DEVICEs with 1 process
[14:06:11.958] INFO:deepqmc.app: Will work in /home/xiazhuozhao/deepqmc-1006/workdir
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/xiazhuozhao/deepqmc-1006/install/deepqmc/src/deepqmc/app.py", line 174, in cli
    raise e.__cause__ from None
  File "/home/xiazhuozhao/anaconda3/envs/deepqmc-1006/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target
    return _target_(*args, **kwargs)
  File "/home/xiazhuozhao/deepqmc-1006/install/deepqmc/src/deepqmc/app.py", line 72, in train_from_factories
    return train(hamil, ansatz, sampler=sampler, **kwargs)
  File "/home/xiazhuozhao/deepqmc-1006/install/deepqmc/src/deepqmc/train.py", line 183, in train
    h5file = h5py.File(os.path.join(workdir, 'result.h5'), 'a', libver='v110')
  File "/home/xiazhuozhao/anaconda3/envs/deepqmc-1006/lib/python3.9/site-packages/h5py/_hl/files.py", line 567, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/home/xiazhuozhao/anaconda3/envs/deepqmc-1006/lib/python3.9/site-packages/h5py/_hl/files.py", line 243, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (file is already open for write/SWMR write (may use <h5clear file> to clear file consistency flags))

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
zenoone commented 1 year ago

Hi, this looks like a log file has not been closed/removed properly. Can you please make sure to clear your logging directory (workdir) and run again? This might have happened when the code terminated in an erroneous state before.

xiazhuozhao commented 1 year ago

Sorry, but I tried to remove workdir and reinstall deepqmc and got the same result. Can you provide a system environment (CUDA version, GPU, and system environment, etc.) that can be successfully installed and run, and I will try to replicate it

szbernat commented 1 year ago

Hi,

Thanks for trying our code, and sorry for these difficulties.

Regarding the environment:

From the output you've posted it seems like you've tried to run deepqmc twice.

  1. The first attempt seems to have failed due to some internal jax errors. We've also experienced these errors with the newest 0.4.17 jax version released a few days ago. These errors should be fixed by downgrading to the previous 0.4.16 version of jax.
  2. The second attempt failed because it tried to open a log file from the previous run (since both attempts were started with the same working directory). Opening this file failed, because the previous run exited with an error, and could not properly close said log file.

To summarize:

  1. Create a new virtual environment.
  2. In this environment install version 0.4.16 of jax. Based on the output you posted this can be done with the command:
    pip install --upgrade "jax[cuda11_pip]==0.4.16" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
  3. Install the latest version of deepqmc from this repository (the latest commit, not the latest release).
  4. Try running deepqmc specifying an empty working directory.

Please report back, if using jax==0.4.16 works, we'll restrict our requirements to jax<0.4.17 until its bugs have been fixed.

Digoba commented 1 year ago

Hi! I'm having some troubles while using deepqmc after the upgrade. This is the error working with FermiNet, but happens something similar while working with PauliNet. The thing is that this problem didn't hapen with the past version. Do you know what could be happening?

image
szbernat commented 1 year ago

Hi Diego,

Thanks for trying our code!

Could you specify which version of deepqmc are you trying to use? Is it the latest released version 1.1.0? If so I'd suggest to try installing the latest commit from this repository, as those contain fixes for issues similar to yours.

We are working on making a new release fixing these issues, as soon as possible..

Digoba commented 1 year ago

Yes, I'm working with the latest released version (1.1.0).

I installed first CUDA 12 as instructed:

CUDA 12 installation pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Then, I installed using the git version, as the instructions said:

git clone https://github.com/deepqmc/deepqmc cd deepqmc pip install -e .[dev]

it starts but I get the error I mentioned, how can I get the latest commits, I did a git pull but nothing updated.

szbernat commented 1 year ago

Hmm, so far we couldn't reproduce this issue on our end.

There was a bug that produced a similar error message some time ago, but that should be fixed with the latest commits.

how can I get the latest commits, I did a git pull but nothing updated.

Then you should be using the latest commits already.

Could you please post the contents of your energy.py script which you use to run the calculation? I'm interested in how you initialize the ansatz. Are you following this section of the tutorial? More concretely, are you passing the _convert_='all' argument to hydra.utils.instantiate? This argument was erroneously missing from an earlier version of the tutorial, so maybe your script is still using the earlier buggy version.

Digoba commented 1 year ago

Of course, this is the energy.py script:

#CREATE A MOLECULE
from deepqmc import Molecule
mol = Molecule.from_name('CH4')

#CREATE THE MOLECULAR HAMILTONIAN
from deepqmc import MolecularHamiltonian
H = MolecularHamiltonian(mol=mol)

#CREATE A WAVE FUNCTION ANSATZ
import os

import haiku as hk
from hydra import compose, initialize_config_dir
from hydra.utils import instantiate

import deepqmc
from deepqmc.app import instantiate_ansatz
from deepqmc.wf import NeuralNetworkWaveFunction

deepqmc_dir = os.path.dirname(deepqmc.__file__)
config_dir = os.path.join(deepqmc_dir, 'conf/ansatz')

with initialize_config_dir(version_base=None, config_dir=config_dir):
    cfg = compose(config_name='paulinet')

_ansatz = instantiate(cfg, _recursive_=True, _convert_='all')
ansatz = instantiate_ansatz(H, _ansatz)

def ansatz(phys_conf, return_mos=False):
    return _ansatz(H)(phys_conf, return_mos=return_mos)

#INSTANTIATE A SAMPLER
from deepqmc.sampling import chain, MetropolisSampler, DecorrSampler

sampler = chain(DecorrSampler(length=20),MetropolisSampler(H))

#OPTIMIZE THE ANSATZ
from deepqmc import train
train(H, ansatz, 'kfac', sampler, steps=10000, sample_size=2000, seed=42, workdir='Outputs')

I did the script as the tutorial says.

Digoba commented 1 year ago

I run it again and now I have the next error:

image
szbernat commented 1 year ago

Thanks for the additional info!

I couldn't exactly reproduce this issue with the latest version of the code. I suspect there still might be some version mismatch.

Coincidentally, the newest release of v1.1.1 has just dropped, I'd suggest updating to that version so at the very least we can more easily help with debugging.

Here is my step-by-step suggestion (tested that this works on my end):

  1. Create a fresh, new virtual environment.
  2. Activate the environment, and make sure you have the latest version of pip with the command: pip install --upgrade pip
  3. Install jax as usual with: pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
  4. Install the latest 1.1.1 release of deepqmc with: pip install deepqmc.
  5. Run the following, slightly modified version of your energy.py script. This version should match the current tutorial.
    
    #CREATE A MOLECULE
    from deepqmc import Molecule
    mol = Molecule.from_name('CH4')

CREATE THE MOLECULAR HAMILTONIAN

from deepqmc import MolecularHamiltonian H = MolecularHamiltonian(mol=mol)

CREATE A WAVE FUNCTION ANSATZ

import os

import haiku as hk from hydra import compose, initialize_config_dir from hydra.utils import instantiate

import deepqmc from deepqmc.app import instantiate_ansatz from deepqmc.wf import NeuralNetworkWaveFunction

deepqmc_dir = os.path.dirname(deepqmc.file) config_dir = os.path.join(deepqmc_dir, 'conf/ansatz')

with initialize_config_dir(version_base=None, config_dir=config_dir): cfg = compose(config_name='paulinet')

_ansatz = instantiate(cfg, recursive=True, convert='all') ansatz = instantiate_ansatz(H, _ansatz)

INSTANTIATE A SAMPLER

from deepqmc.sampling import chain, MetropolisSampler, DecorrSampler

sampler = chain(DecorrSampler(length=20),MetropolisSampler(H))

OPTIMIZE THE ANSATZ

from deepqmc import train train(H, ansatz, 'kfac', sampler, steps=10000, electron_batch_size=2000, seed=42, workdir='Outputs')



If this works, and you want to install `deepqmc` in editable mode, such that you can develop it:

1. In your virtual environment, uninstall the pypi package: `pip uninstall deepqmc`
2. Clone this git repository: `git clone https://github.com/deepqmc/deepqmc.git`
3. Change to the repository's directory and install in editable mode: `cd deepqmc && pip install --upgrade -e .[dev]`
Digoba commented 1 year ago

Hello I have tried what you told me but now I have the next error:

image

szbernat commented 1 year ago

Hi,

This is because the computation for a batch of electron samples is parallelized over the available GPUs. For example if you specify electron_batch_size=1000 and you have two GPUs available, the computations for the first 500 samples will be carried out on the first GPU and the computations for the other 500 samples on the second GPU. However, if the electron_batch_size is not divisible by the number of available GPUs this kind of partitioning doesn't work. So electron_batch_size % device_count must equal 0.

Are you intentionally trying to run on multiple GPUs? If so make sure to specify an electron_batch_size that is divisible by the number of GPUs. If not, you can use the CUDA_VISIBLE_DEVICES environment variable to restrict deepqmc to use only a single GPU, e.g. export CUDA_VISIBLE_DEVICES=0 will make sure that deepqmc only utilizes the first GPU of the machine.

Hope this helps!

szbernat commented 1 year ago

Btw, I'm going to close this issue now, since these problems are no longer related to the original problems with installing deepqmc.

If you run into any more problems, please don't hesitate to open a new issue.

xiazhuozhao commented 12 months ago

Hello, @szbernat. It's been a while since your team and I last collaborated on solving the issue. I greatly appreciate the efforts you've put into it. We've identified that the outdated NVIDIA driver version was the culprit, unable to support CUDA 11.8, I managed to upgrade the driver to version 535. However, post the successful update, the server we initially tested fell victim to a network attack. The NIC decided to disconnect all servers from the internet. We were left with the only option of using a jumpserver with a 1MB bandwidth as a data relay, making the installation of various dependency libraries exceptionally challenging. We then resorted to utilizing computing resources from a public supercomputing center. Unfortunately, when we contacted the service provider to upgrade the NVIDIA driver, their erroneous actions resulted in our server being unable to boot, and we couldn't even access the BIOS. Hastily, we retrieved data from the RAID hard drive and await the service provider's resolution. It's truly unfortunate that, aside from these two machines, we don't have any other available. However, a member within our team is actively exploring solutions to connect the server with internet access. They might be on the verge of resolving this issue. We'll certainly provide an update here once there's progress in our testing.

szbernat commented 12 months ago

Hi, I'm sorry to hear about all these difficulties, it sounds like your last few weeks have been very challenging. Managing NVIDIA drivers has never been easy, and in our experience JAX can also add to the complications.

Do let us know when you're back on track, we'd love to hear how DeepQMC is working out for you.