SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

Parallel LAMMPS execution #160

Open eirrgang opened 3 years ago

eirrgang commented 3 years ago

This issue is for tracking the progress on managing accelerated LAMMPS tasks.

Once #158 is merged, we should begin to iteratively move LAMMPS tasks under RP and then SCALE-MS management.

Questions

How does LAMMPS determine the available hardware resources?

Is there any difference when launched from Python?

Note that RP Raptor tasks are currently confined to a single node. (Is there an issue to track for this?) When an arbitrarily large node is available (or a fraction thereof), how can it / should it be allocated to LAMMPS? Should we run more than one MPI rank? Would we set OMP_NUM_THREADS, some other environment variable, a command line option, or a Python-level option?

Can we interact with the resource allocation at all from within Python? It seems like LAMMPS assumes MPI_INIT is managed externally when run in Python... is this right? Is this why we use mpi4py? Or can we go further and use mpi4py to define the Communicator that LAMMPS will use?

How does LAMMPS determine whether to use MPI or not at run time? I.e. when should we expect LAMMPS to call MPI_Init(), make MPI calls or not? When should we expect it to grab MPI_COMM_WORLD or some other communicator or not? If we initialize MPI in the Python interpreter, but don't want lammps to use it, how would we handle that?

Jianming-C commented 3 years ago

I used the following commands in Python to check the available hardware resources and run parallel LAMMPS:

tasks = range(n)
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)

pool.map(run_lmp, tasks)

In my experience, we need to determine the hardware resources in the LAMMPS input file. In a nutshell, I decided the cores per LAMMPS task in the input file and then Python would detect the available resources and assign the tasks.

eirrgang commented 3 years ago

I used the following commands in Python to check the available hardware resources and run parallel LAMMPS:

tasks = range(n)
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)

pool.map(run_lmp, tasks)

This example looks like it runs each lammps task on just one CPU. What am I missing?

In my experience, we need to determine the hardware resources in the LAMMPS input file.

What would that look like?

In a nutshell, I decided the cores per LAMMPS task in the input file and then Python would detect the available resources and assign the tasks.

Are your "cores per LAMMPS task" used as one thread-per-core or one MPI-rank-per-core?

Jianming-C commented 3 years ago

I need a correction here... The available resources is determined in the submitted script of Penn State ACI, not the LAMMPS input file. For Penn State ACI, it looks like: PBS -l nodes=1:ppn=10:stmem # 1 CPU, 10 cores/CPU

This example looks like it runs each lammps task on just one CPU. What am I missing?

The example is running each LAMMPS task on one core.

Are your "cores per LAMMPS task" used as one thread-per-core or one MPI-rank-per-core?

The command line for using multiple cores in LAMMPS is: mpirun -np 4 lmp_mpi -in in.file This example is using 4 cores for the 'in.file' task.

eirrgang commented 3 years ago

I need a correction here... The available resources is determined in the submitted script of Penn State ACI, not the LAMMPS input file. For Penn State ACI, it looks like: PBS -l nodes=1:ppn=10:stmem # 1 CPU, 10 cores/CPU

You mean "1 node, 10 processors per node," right? And it sounds like you are running one process per core as one single-threaded LAMMPS simulation per process.

This example looks like it runs each lammps task on just one CPU. What am I missing?

The example is running each LAMMPS task on one core.

Are your "cores per LAMMPS task" used as one thread-per-core or one MPI-rank-per-core?

The command line for using multiple cores in LAMMPS is: mpirun -np 4 lmp_mpi -in in.file This example is using 4 cores for the 'in.file' task.

Right.

Sorry if it wasn't clear, but i was specifically interested in discussing use cases of multiple cores per simulation task. (Or multiple simulation processes communicating at run time. )

It sounds like the work at Penn State may not be using any parallelization at the simulation level. Is that a fair assessment? Or is there any run-time communication between simulations?