ICAMS / calphy

A Python library and command line interface for automated free energy calculations
Other
68 stars 21 forks source link

Cannot launch lammps in calphy calculation #105

Closed yuxzhou closed 3 weeks ago

yuxzhou commented 7 months ago

Dear developers and users,

I'm trying to run a ts calculation in which I expect MD will be run after the initialization. However, the job seemed to get stuck when it was trying to wake the lammps driver. The job didn't die but no output was generated.

This is the calphy.log file:

2024-02-16 22:51:49,035 calphy.helpers INFO     ---------------input file----------------
2024-02-16 22:51:49,035 calphy.helpers INFO     commented out as causes crash when we're expanding the T range after a fail run
2024-02-16 22:51:49,035 calphy.helpers INFO     ------------end of input file------------
2024-02-16 22:51:49,035 calphy.helpers INFO     Temperature start: 4050.000000 K, temperature stop: 1550.000000 K, pressure: 5000.000000 bar
2024-02-16 22:51:49,035 calphy.helpers INFO     Pressure adjusted in aniso
2024-02-16 22:51:49,035 calphy.helpers INFO     Reference phase is liquid
2024-02-16 22:51:49,035 calphy.helpers INFO     Melting cycle is turned off
2024-02-16 22:51:49,035 calphy.helpers INFO     Equilibration stage is done using nose-hoover barostat/thermostat
2024-02-16 22:51:49,035 calphy.helpers INFO     Nose-Hoover thermostat damping is 0.100000
2024-02-16 22:51:49,035 calphy.helpers INFO     Nose-Hoover barostat damping is 0.100000
2024-02-16 22:51:49,035 calphy.helpers INFO     These values can be tuned by adding in the input file:
2024-02-16 22:51:49,035 calphy.helpers INFO     nose_hoover:
2024-02-16 22:51:49,036 calphy.helpers INFO        thermostat_damping: <float>
2024-02-16 22:51:49,036 calphy.helpers INFO        barostat_damping: <float>
2024-02-16 22:51:49,036 calphy.helpers INFO     Integration stage is done using Nose-Hoover thermostat and barostat when needed
2024-02-16 22:51:49,036 calphy.helpers INFO     Thermostat damping is 0.100000
2024-02-16 22:51:49,036 calphy.helpers INFO     Barostat damping is 0.100000
2024-02-16 22:51:49,036 calphy.helpers INFO     4536 atoms in 1 cells on 64 cores
2024-02-16 22:51:49,036 calphy.helpers INFO     pair_style: pace
2024-02-16 22:51:49,036 calphy.helpers INFO     pair_coeff: * * /work/home/zhanggweitest/yxzhou/ML_potentials/te-upfit-iter0.yace Te

and this is my input:

- element: 'Te'
  mass: 127.603
  mode: ts
  temperature: [1050.0,550.0]
  pressure: [[5000,5000,5000]]
  lattice: Te.data
  npt: False
  reference_phase: liquid
  pair_style: pace
  pair_coeff: '* * /work/home/zhanggweitest/yxzhou/ML_potentials/te-upfit-iter0.yace  Te'
  timestep: 0.001
  melting_cycle: False
  n_equilibration_steps: 50000
  n_switching_steps: 50000
  n_iterations: 1
  equilibration_control: nose-hoover

  nose_hoover:
    thermostat_damping: 0.1

  tolerance:
    solid_fraction: 0

  queue:
    scheduler: slurm
    jobname: calphy-test
    nodes: 1
    cores: 64
    queuename: xahcnormal

    commands:
      - source ~/.bashrc
      - conda activate calphy

Any idea of what is going wrong?

srmnitc commented 7 months ago

@yuxzhou Thanks for reporting the issue, could you please tell me the version of calphy, lammps, and pylammpsmpi that you are using?

yuxzhou commented 7 months ago

Thanks for the quick reply @srmnitc. I followed the installation introduction from the website: (1) git clone https://github.com/ICAMS/calphy.git; (2) cd calphy and conda env create -f environment.yml; (3) conda activate calphy; and (4) python setup.py install.

I also double check the versions of LAMMPS (21 Nov 2023 released), pylammpsmpi (0.2.3), and Calphy (1.2.16).

srmnitc commented 7 months ago

I now spotted an issue with the version of pylammpsmpi; could you please update with conda install -c conda-forge pylammpsmpi=0.2.13 and try. I will now update the env file. A new conda release is already being worked on.

yuxzhou commented 7 months ago

Thanks for help! The update of pylammpsmpi to 0.2.13 indeed helps wake up the lammps. However, there is still now any output from LAMMPS and I received another error in the *err file (while the job was still running).

No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           nid001169
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
[nid001169:239896] 1023 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[nid001169:239896] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I tried everything on the other cluster, and got a mpi error again:

[c03r4n35:31497] OPAL ERROR: Not initialized in file ext3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[c03r4n35:31497] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[c03r4n35:31500] OPAL ERROR: Not initialized in file ext3x_client.c at line 112
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[nid001169:239896] 1023 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[nid001169:239896] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Any clue of thow to solve it?

Thank you!

srmnitc commented 7 months ago

Thanks again! What do you get in the calphy.log file?

yuxzhou commented 7 months ago

Sorry for the late response. This is what calphy.log looks like in one of my calphy calculations.

2024-02-26 20:45:24,878 calphy.helpers INFO     ---------------input file----------------
2024-02-26 20:45:24,879 calphy.helpers INFO     commented out as causes crash when we're expanding the T range after a fail run
2024-02-26 20:45:24,879 calphy.helpers INFO     ------------end of input file------------
2024-02-26 20:45:24,879 calphy.helpers INFO     Temperature start: 900.000000 K, temperature stop: 500.000000 K, pressure: 0.000000 bar
2024-02-26 20:45:24,879 calphy.helpers INFO     Pressure adjusted in iso
2024-02-26 20:45:24,879 calphy.helpers INFO     Reference phase is liquid
2024-02-26 20:45:24,879 calphy.helpers INFO     Melting cycle is turned off
2024-02-26 20:45:24,879 calphy.helpers INFO     Equilibration stage is done using nose-hoover barostat/thermostat
2024-02-26 20:45:24,879 calphy.helpers INFO     Nose-Hoover thermostat damping is 0.100000
2024-02-26 20:45:24,879 calphy.helpers INFO     Nose-Hoover barostat damping is 0.100000
2024-02-26 20:45:24,879 calphy.helpers INFO     These values can be tuned by adding in the input file:
2024-02-26 20:45:24,879 calphy.helpers INFO     nose_hoover:
2024-02-26 20:45:24,879 calphy.helpers INFO        thermostat_damping: <float>
2024-02-26 20:45:24,879 calphy.helpers INFO        barostat_damping: <float>
2024-02-26 20:45:24,879 calphy.helpers INFO     Integration stage is done using Nose-Hoover thermostat and barostat when needed
2024-02-26 20:45:24,879 calphy.helpers INFO     Thermostat damping is 0.100000
2024-02-26 20:45:24,880 calphy.helpers INFO     Barostat damping is 0.100000
2024-02-26 20:45:24,880 calphy.helpers INFO     4536 atoms in 1 cells on 128 cores
2024-02-26 20:45:24,880 calphy.helpers INFO     pair_style: pace
2024-02-26 20:45:24,880 calphy.helpers INFO     pair_coeff: * * /work/e846/e846/yx_zhou/ML_potentials/Te-ACE/te-upfit-iter0.yace Te

It seems that the lammps has been launched but not correctly (i.e., died due to some reasons)?

srmnitc commented 7 months ago

Seems to be the case, could you please check the version of mpi4py in the environment, thanks again!

yuxzhou commented 7 months ago

Sure! The version of mpi4py is 3.1.4 in my conda environment

srmnitc commented 4 months ago

I cant seem to reproduce this on the lammps side, could you please run a LAMMPS calculation directly through the library interface, and see if that works.

srmnitc commented 3 weeks ago

Closing due to inactivity, please feel free to reopen if needed.