Closed xiazhuozhao closed 1 year ago
Hi, thanks for the detailed report! I can reproduce this error when trying to evaluate my ansätze. I think #179 should fix it. @szbernat could you check it?
Thank you for your efforts. I pulled the latest code from Github git clone https://github.com/deepqmc/deepqmc.git
and reinstalled it pip install -e .[dev]
.
Now problem seems like this:
(deepqmc-1006) [xiazhuozhao@c2 deepqmc-1006]$ deepqmc hydra.run.dir=workdir
2023-10-06 14:02:58.090725: W external/xla/xla/service/gpu/nvptx_compiler.cc:702] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
[14:03:00.787] INFO:deepqmc.app: Entering application
[14:03:00.788] INFO:deepqmc.app: Running on 2 NVIDIA GRAPHICS DEVICEs with 1 process
[14:03:00.789] INFO:deepqmc.app: Will work in /home/xiazhuozhao/deepqmc-1006/workdir
[14:03:42.681] INFO:deepqmc.train: Number of model parameters: 197589
[14:03:43.096] INFO:deepqmc.train: Pretraining wrt. baseline wave function
[14:03:43.965] INFO:deepqmc.wf.baseline.pyscfext: Running HF...
[14:03:44.966] INFO:deepqmc.wf.baseline.pyscfext: HF energy: -7.951971538662545
pretrain: 48%|██████████████████████████████████████████████████████████████████████████████████▉ | 2411/5000 [01:37<01:15, 34.14it/s, MSE=3.36e-05]2023-10-06 14:05:21.066345: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: an unsupported value or parameter was passed to the function
2023-10-06 14:05:21.066477: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: CaptureGpuGraph failed (an unsupported value or parameter was passed to the function; current tracing scope: custom-call.30): INTERNAL: Failed to end stream capture: CUDA_ERROR_STREAM_CAPTURE_INVALIDATED: operation failed due to a previous error during capture
2023-10-06 14:05:21.066696: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2711] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.graph.launch' failed: CaptureGpuGraph failed (an unsupported value or parameter was passed to the function; current tracing scope: custom-call.30): INTERNAL: Failed to end stream capture: CUDA_ERROR_STREAM_CAPTURE_INVALIDATED: operation failed due to a previous error during capture; current profiling annotation: XlaModule:#hlo_module=pmap_pretrain_step,program_id=542#.
2023-10-06 14:05:31.066924: F external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2857] Replicated computation launch failed, but not all replicas terminated. Aborting process to work around deadlock. Failure message (there may have been multiple failures, see the error log for all failures):
Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.graph.launch' failed: CaptureGpuGraph failed (an unsupported value or parameter was passed to the function; current tracing scope: custom-call.30): INTERNAL: Failed to end stream capture: CUDA_ERROR_STREAM_CAPTURE_INVALIDATED: operation failed due to a previous error during capture; current profiling annotation: XlaModule:#hlo_module=pmap_pretrain_step,program_id=542#.
Aborted
(deepqmc-1006) [xiazhuozhao@c2 deepqmc-1006]$ deepqmc hydra.run.dir=workdir
2023-10-06 14:06:09.319226: W external/xla/xla/service/gpu/nvptx_compiler.cc:702] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
[14:06:11.957] INFO:deepqmc.app: Entering application
[14:06:11.958] INFO:deepqmc.app: Running on 2 NVIDIA GRAPHICS DEVICEs with 1 process
[14:06:11.958] INFO:deepqmc.app: Will work in /home/xiazhuozhao/deepqmc-1006/workdir
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/xiazhuozhao/deepqmc-1006/install/deepqmc/src/deepqmc/app.py", line 174, in cli
raise e.__cause__ from None
File "/home/xiazhuozhao/anaconda3/envs/deepqmc-1006/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target
return _target_(*args, **kwargs)
File "/home/xiazhuozhao/deepqmc-1006/install/deepqmc/src/deepqmc/app.py", line 72, in train_from_factories
return train(hamil, ansatz, sampler=sampler, **kwargs)
File "/home/xiazhuozhao/deepqmc-1006/install/deepqmc/src/deepqmc/train.py", line 183, in train
h5file = h5py.File(os.path.join(workdir, 'result.h5'), 'a', libver='v110')
File "/home/xiazhuozhao/anaconda3/envs/deepqmc-1006/lib/python3.9/site-packages/h5py/_hl/files.py", line 567, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/home/xiazhuozhao/anaconda3/envs/deepqmc-1006/lib/python3.9/site-packages/h5py/_hl/files.py", line 243, in make_fid
fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (file is already open for write/SWMR write (may use <h5clear file> to clear file consistency flags))
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Hi, this looks like a log file has not been closed/removed properly. Can you please make sure to clear your logging directory (workdir
) and run again? This might have happened when the code terminated in an erroneous state before.
Sorry, but I tried to remove workdir and reinstall deepqmc and got the same result. Can you provide a system environment (CUDA version, GPU, and system environment, etc.) that can be successfully installed and run, and I will try to replicate it
Hi,
Thanks for trying our code, and sorry for these difficulties.
Regarding the environment:
2023-10-06 14:02:58.090725: W external/xla/xla/service/gpu/nvptx_compiler.cc:702] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
From the output you've posted it seems like you've tried to run deepqmc
twice.
jax
errors. We've also experienced these errors with the newest 0.4.17
jax version released a few days ago. These errors should be fixed by downgrading to the previous 0.4.16
version of jax.To summarize:
0.4.16
of jax. Based on the output you posted this can be done with the command:
pip install --upgrade "jax[cuda11_pip]==0.4.16" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
deepqmc
from this repository (the latest commit, not the latest release).deepqmc
specifying an empty working directory.Please report back, if using jax==0.4.16
works, we'll restrict our requirements to jax<0.4.17
until its bugs have been fixed.
Hi! I'm having some troubles while using deepqmc after the upgrade. This is the error working with FermiNet, but happens something similar while working with PauliNet. The thing is that this problem didn't hapen with the past version. Do you know what could be happening?
Hi Diego,
Thanks for trying our code!
Could you specify which version of deepqmc
are you trying to use? Is it the latest released version 1.1.0
? If so I'd suggest to try installing the latest commit from this repository, as those contain fixes for issues similar to yours.
We are working on making a new release fixing these issues, as soon as possible..
Yes, I'm working with the latest released version (1.1.0).
I installed first CUDA 12 as instructed:
CUDA 12 installation pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
Then, I installed using the git version, as the instructions said:
git clone https://github.com/deepqmc/deepqmc cd deepqmc pip install -e .[dev]
it starts but I get the error I mentioned, how can I get the latest commits, I did a git pull but nothing updated.
Hmm, so far we couldn't reproduce this issue on our end.
There was a bug that produced a similar error message some time ago, but that should be fixed with the latest commits.
how can I get the latest commits, I did a git pull but nothing updated.
Then you should be using the latest commits already.
Could you please post the contents of your energy.py
script which you use to run the calculation? I'm interested in how you initialize the ansatz
. Are you following this section of the tutorial? More concretely, are you passing the _convert_='all'
argument to hydra.utils.instantiate
? This argument was erroneously missing from an earlier version of the tutorial, so maybe your script is still using the earlier buggy version.
Of course, this is the energy.py script:
#CREATE A MOLECULE
from deepqmc import Molecule
mol = Molecule.from_name('CH4')
#CREATE THE MOLECULAR HAMILTONIAN
from deepqmc import MolecularHamiltonian
H = MolecularHamiltonian(mol=mol)
#CREATE A WAVE FUNCTION ANSATZ
import os
import haiku as hk
from hydra import compose, initialize_config_dir
from hydra.utils import instantiate
import deepqmc
from deepqmc.app import instantiate_ansatz
from deepqmc.wf import NeuralNetworkWaveFunction
deepqmc_dir = os.path.dirname(deepqmc.__file__)
config_dir = os.path.join(deepqmc_dir, 'conf/ansatz')
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(config_name='paulinet')
_ansatz = instantiate(cfg, _recursive_=True, _convert_='all')
ansatz = instantiate_ansatz(H, _ansatz)
def ansatz(phys_conf, return_mos=False):
return _ansatz(H)(phys_conf, return_mos=return_mos)
#INSTANTIATE A SAMPLER
from deepqmc.sampling import chain, MetropolisSampler, DecorrSampler
sampler = chain(DecorrSampler(length=20),MetropolisSampler(H))
#OPTIMIZE THE ANSATZ
from deepqmc import train
train(H, ansatz, 'kfac', sampler, steps=10000, sample_size=2000, seed=42, workdir='Outputs')
I did the script as the tutorial says.
I run it again and now I have the next error:
Thanks for the additional info!
I couldn't exactly reproduce this issue with the latest version of the code. I suspect there still might be some version mismatch.
Coincidentally, the newest release of v1.1.1
has just dropped, I'd suggest updating to that version so at the very least we can more easily help with debugging.
Here is my step-by-step suggestion (tested that this works on my end):
pip
with the command: pip install --upgrade pip
jax
as usual with: pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
1.1.1
release of deepqmc
with: pip install deepqmc
.energy.py
script. This version should match the current tutorial.
#CREATE A MOLECULE
from deepqmc import Molecule
mol = Molecule.from_name('CH4')
from deepqmc import MolecularHamiltonian H = MolecularHamiltonian(mol=mol)
import os
import haiku as hk from hydra import compose, initialize_config_dir from hydra.utils import instantiate
import deepqmc from deepqmc.app import instantiate_ansatz from deepqmc.wf import NeuralNetworkWaveFunction
deepqmc_dir = os.path.dirname(deepqmc.file) config_dir = os.path.join(deepqmc_dir, 'conf/ansatz')
with initialize_config_dir(version_base=None, config_dir=config_dir): cfg = compose(config_name='paulinet')
_ansatz = instantiate(cfg, recursive=True, convert='all') ansatz = instantiate_ansatz(H, _ansatz)
from deepqmc.sampling import chain, MetropolisSampler, DecorrSampler
sampler = chain(DecorrSampler(length=20),MetropolisSampler(H))
from deepqmc import train train(H, ansatz, 'kfac', sampler, steps=10000, electron_batch_size=2000, seed=42, workdir='Outputs')
If this works, and you want to install `deepqmc` in editable mode, such that you can develop it:
1. In your virtual environment, uninstall the pypi package: `pip uninstall deepqmc`
2. Clone this git repository: `git clone https://github.com/deepqmc/deepqmc.git`
3. Change to the repository's directory and install in editable mode: `cd deepqmc && pip install --upgrade -e .[dev]`
Hello I have tried what you told me but now I have the next error:
Hi,
This is because the computation for a batch of electron samples is parallelized over the available GPUs. For example if you specify electron_batch_size=1000
and you have two GPUs available, the computations for the first 500 samples will be carried out on the first GPU and the computations for the other 500 samples on the second GPU. However, if the electron_batch_size
is not divisible by the number of available GPUs this kind of partitioning doesn't work. So electron_batch_size % device_count
must equal 0.
Are you intentionally trying to run on multiple GPUs? If so make sure to specify an electron_batch_size
that is divisible by the number of GPUs. If not, you can use the CUDA_VISIBLE_DEVICES
environment variable to restrict deepqmc
to use only a single GPU, e.g. export CUDA_VISIBLE_DEVICES=0
will make sure that deepqmc
only utilizes the first GPU of the machine.
Hope this helps!
Btw, I'm going to close this issue now, since these problems are no longer related to the original problems with installing deepqmc
.
If you run into any more problems, please don't hesitate to open a new issue.
Hello, @szbernat. It's been a while since your team and I last collaborated on solving the issue. I greatly appreciate the efforts you've put into it. We've identified that the outdated NVIDIA driver version was the culprit, unable to support CUDA 11.8, I managed to upgrade the driver to version 535. However, post the successful update, the server we initially tested fell victim to a network attack. The NIC decided to disconnect all servers from the internet. We were left with the only option of using a jumpserver with a 1MB bandwidth as a data relay, making the installation of various dependency libraries exceptionally challenging. We then resorted to utilizing computing resources from a public supercomputing center. Unfortunately, when we contacted the service provider to upgrade the NVIDIA driver, their erroneous actions resulted in our server being unable to boot, and we couldn't even access the BIOS. Hastily, we retrieved data from the RAID hard drive and await the service provider's resolution. It's truly unfortunate that, aside from these two machines, we don't have any other available. However, a member within our team is actively exploring solutions to connect the server with internet access. They might be on the verge of resolving this issue. We'll certainly provide an update here once there's progress in our testing.
Hi, I'm sorry to hear about all these difficulties, it sounds like your last few weeks have been very challenging. Managing NVIDIA drivers has never been easy, and in our experience JAX can also add to the complications.
Do let us know when you're back on track, we'd love to hear how DeepQMC is working out for you.
I'm glad to see you have released deepQMC v1.1.0, which resolves the dependency issues I had installing v1.0.1. However, I'm still running into some errors that I'm hoping you can help with.
System environment
CPU Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz GPU NVIDIA A800 CUDA 11.6 Linux version 3.10.0-1160.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Mon Oct 19 16:18:59 UTC 2020 Run it directly on a compute node without using slurm.
What did I do?
conda create --name deepqmc python==3.9
conda activate deepqmc
pip install -U deepqmc
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
What's wrong?
see this:
(deepqmc) [xiazhuozhao@c2 work]$ deepqmc task=evaluate task.restdir=workdir WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1695997281.783160 16715 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created. 2023-09-29 22:21:25.236598: W external/xla/xla/service/gpu/nvptx_compiler.cc:708] The NVIDIA driver's CUDA version is 11.6 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages. [22:21:27.980] INFO:deepqmc.app: Entering application [22:21:27.982] INFO:deepqmc.app: Running on 2 NVIDIA GRAPHICS DEVICEs with 1 process [22:21:27.982] INFO:deepqmc.app: Will work in /home/xiazhuozhao/deepqmc-230929/work/outputs/2023-09-29/22-21-27 [22:21:28.069] INFO:deepqmc.app: Found original config file in /home/xiazhuozhao/deepqmc-230929/work/workdir [22:21:30.441] INFO:deepqmc.train: Start evaluation [22:21:58.536] INFO:deepqmc.train: Progress: 1/1000, energy = -8.1(1.0)
evaluation: 0%| | 0/1000 [00:27<?, ?it/s, E=-8.1(1.0)] Error executing job with overrides: ['task=evaluate', 'task.restdir=workdir'] Error in call to target 'deepqmc.app.train_from_factories': AttributeError("'dict' object has no attribute 'ndim'")
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. I0000 00:00:1695997319.880270 16715 tfrt_cpu_pjrt_client.cc:352] TfrtCpuClient destroyed.