QCG memory issues with small system

rkingsbury commented 1 year ago

Hello, I'm quite excited about the QCG algorithm in CREST. I'm trying to use it to generate clusters of bare alkali cations (Li+, Na+, etc.) solvated with approximately 20 water molecules. Despite the small system size, I seem to be encountering memory problems. The exact failure mode varies, but error messages usually contain one of the following:

segmentation fault
invalid memory reference
munmap_chunk(): invalid pointer
free(): invalid size
corrupted size vs. prev_size

The command I am using is the following. Li.xyz contains a single Lithium atom and H2O.xyz a single water molecule.

crest Li.xyz --qcg H2O.xyz --nsolv 20 --alpb water --chrg 1 --gsolv --nocross > qcg.out

I have tried this on a desktop workstation with 32 GB of RAM and on the Perlmutter supercomputer with 256 GB RAM and had similar behavior on both. I have experimented with different values of the OMP_NUM_THREADS variable and with running ulimit -s unlimited.

Anecdotally, it seems like adding --wscal 1.0 or adding the --nocross flag based on #109 both help the calculation get further before running out of RAM, but nothing allows it to complete.

I'd appreciate any suggestions on how to troubleshoot the above errors.

cplett commented 1 year ago

Hi, Did you also set the stack size with the OMP_STACKSIZE command? This is usually required for more expensive calculations. I typically use 4 G, which caused no problems so far (also for a test run with your system). To do so, simply add "export OMP_STACKSIZE=4G" to your .bashrc.

rkingsbury commented 1 year ago

Thank you for the reply. Unfortunately OMP_STACKSIZE does not seem to have made much difference. I only tested on the desktop workstation. With a setting of 4G, 8G, or 16G, the calculation failed at 6, 10 and 6 solvent molecules with the following errors:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
free(): invalid size
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

cplett commented 1 year ago

Which versions of xtb and crest do you use? Did you compile them by your own from the source code or do you use a precompiled binary? If you compiled the programs from source, it would be good to know which compiler you used and if you employ the xtbiff program or the new docking algorithm in xtb.

rkingsbury commented 1 year ago

I'm using pre-compiled versions of xtb==6.5.1 and crest==2.12, both installed from conda-forge. They are running in WSL2 linux on a Windows 11 host. I used the same versions from conda-forge when testing on Perlmutter.

For xtbiff, I downloaded and extracted the pre-compiled binary per the documentation

EDIT: I tested on Perlmutter with a stacksize up to 32G, and got the same result. Specifically, after n=6 water molecules, the Li calculation failed with corrupted size vs. prev_size

rkingsbury commented 1 year ago

Interestingly, the Li + H2O calculation seems to always fail after adding 6 H2O, regardless of OMP stacksize, --wscal value, or --nocross option. Nothing in qcg.out indicates a problem. See example below:

  Size  E /Eh       De/kcal   Detot/kcal  Density   Efix         R   av/act. Surface   Opt
    1    -4.976112  -44.89     -44.89       0.458     -2.488      0.0   0.0     610.1   normal
    2   -10.102586  -35.10     -79.98       0.656     -3.818      3.5   3.5     733.7   normal
    3   -15.216194  -27.02    -107.01       0.790     -4.812      3.6   3.6     865.4   normal
    4   -20.319606  -20.62    -127.63       0.880     -5.636      3.7   3.7    1005.7   normal
   Wall Potential too small, increasing size by 5 %
   New scaling factor 0.73
   Wall Potential too small, increasing size by 5 %
   New scaling factor 0.77
   Wall Potential too small, increasing size by 5 %
   New scaling factor 0.81
    5   -25.407237  -10.72    -138.35       0.925     -6.352      3.7   5.1    1175.1   normal
    6   -30.497136  -12.15    -150.50       0.990     -6.997      4.1   4.9    1301.7   normal

cplett commented 1 year ago

With the conda build, I could reproduce the issue. This seems to be a bug in the crest version which is built with conda. The bug is not present anymore in the current source code. Therefore, I would suggest that, if you have the possibility, you compile the current crest and xtb codes on your own. Then you have also the possibility to use the new docking algorithm in xtb and not the xtbiff anymore, which leads usually to better performance.

rkingsbury commented 1 year ago

With the conda build, I could reproduce the issue. This seems to be a bug in the crest version which is built with conda. The bug is not present anymore in the current source code. Therefore, I would suggest that, if you have the possibility, you compile the current crest and xtb codes on your own. Then you have also the possibility to use the new docking algorithm in xtb and not the xtbiff anymore, which leads usually to better performance.

OK, I will try to compile everything. I'm glad you were able to reproduce and thanks so much for the guidance!

rkingsbury commented 1 year ago

OK, I have managed to compile the latest versions of xtb and crest on Perlmutter (using the GNU compilers). The xtb binary passed all built-in tests and seems to work normally. The crest binary seems to work normally for multi-atom solutes (e.g., crest H2O.xyz behaves as expected), but when I try to run the qcg command above, the initial geometry optimization fails:

...
 Solute geometry
  molecular radius (Bohr**1):    4.89
  molecular area   (Bohr**2):  300.96
  molecular volume (Bohr**3):  490.95
 Solvent geometry
  molecular radius (Bohr**1):    3.90
  molecular area   (Bohr**2):  197.02
  molecular volume (Bohr**3):  247.66

  radius of solute    :     7.89
  radius of solvent   :     6.28

  =========================================
  |            Preoptimization            |
  =========================================

 -------------------------
 xTB Geometry Optimization
 -------------------------

  Initial geometry optimization failed!
  Please check your input.

I get the same error if I try to run a simple crest command on the Li.xyz structure, e.g. Li.xyz. It's not clear to me why this would be happening with the newly-compiled version but wasn't happening before. Could this be related to moving away from xtb-iff?

rkingsbury commented 1 year ago

Interestingly, the exact same command seems to work properly with a multi-atom solute, e.g.

crest H2O.xyz --qcg H2O.xyz --nsolv 20 --alpb water --chrg 1 --gsolv > qcg.out

so I think perhaps I've found another bug?

rkingsbury commented 1 year ago

Actually, I tried using the newly-compiled crest with the --xtbiff flag and I got the same result, so I think this is either a problem with my compilation or a bug in the recent changes to CREST.

However, I was able to work around the problem by using the --nopreopt argument to skip the initial optimization.

cplett commented 1 year ago

Normally, also single atoms should work also with the preoptimization. Could you please provide me with your input files Li.xyz and H2O.xyz? Maybe I can reproduce and fix the problem with your coordinates.

rkingsbury commented 1 year ago

Normally, also single atoms should work also with the preoptimization. Could you please provide me with your input files Li.xyz and H2O.xyz? Maybe I can reproduce and fix the problem with your coordinates.

See below. Thanks for all your help with this!

Li.xyz

1

Li 0 0 0

H2O.xyz

3

O          0.92716       -0.03336        0.04754
H          1.91648       -0.05798        0.08262
H          0.63779       -0.57738        0.82273

github-actions[bot] commented 8 months ago

This issue had no activity for 6 months. It will be closed in 1 week unless there is some new activity.

crest-lab / crest

QCG memory issues with small system #149