NanoComp / meep

free finite-difference time-domain (FDTD) software for electromagnetic simulations
GNU General Public License v2.0
1.23k stars 625 forks source link

Failed to implement "epsilon_input_file" while using Openmpi #1001

Open vinneko opened 5 years ago

vinneko commented 5 years ago

Hello,

We need to use "epsilon_input_file" to import a hdf5 file of the material. The code works fine when we run it on a single node, but we could not make it work on the cluster using Openmpi.

Here, I copy the error messages:

_[proxy:0:0@laser025] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed [proxy:0:0@laser025] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0@laser025] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event [proxy:0:2@laser033] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed [proxy:0:2@laser033] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:2@laser033] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event [proxy:0:3@laser064] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed [proxy:0:3@laser064] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:3@laser064] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event [mpiexec@laser025] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@laser025] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@laser025] HYD_pmci_wait_for_completion (pm/pmiserv/pmiservpmci.c:218): launcher returned error waiting for completion [mpiexec@laser025] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

"laser025", "laser064", "laser033" are names of the nodes.

Please let us know if there is any obvious mistakes or solutions to this problem. Thank you so much for your help!

oskooi commented 5 years ago

Note that if you are building from source, the make check suite includes an explicit test for epsilon_input_file (see https://github.com/NanoComp/meep/blob/master/python/tests/simulation.py#L217-L235). You can verify this yourself by running it manually e.g. mpirun -n 2 python meep/python/tests/simulation.py.

stevengj commented 5 years ago

The code works fine when we run it on a single node, but we could not make it work on the cluster using Openmpi.

Note that for this to work, the file must be readable from every node in the cluster (i.e. be on some cluster-wide filesystem).

vinneko commented 5 years ago

Thank you so much for all the comments. We will contact and discuss with the cluster manager right away.

vinneko commented 5 years ago

Hello,

We still couldn't run the simulation on the cluster. The cluster manager has assured us that the input file is indeed readable to every node, so we don't think that is the issue here.

We have also tried the simulation test suggested by oskooi, and it failed with the same error message.

We also noticed that, aside from the error message, the output message we got is,

   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
   PID 7030 RUNNING AT laser054
   EXIT CODE: 139
   CLEANING UP REMAINING PROCESSES
   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.

This seems to indicate that there might be something wrong with the code itself.

We have no clue what could be the issue here, so any help would be appreciated.

Thank you so much.

oskooi commented 5 years ago

I have seen this message several times before. I suspect that your Meep installation is fine but the simulation is crashing because of a memory overload (i.e., your simulation is attempting to consume more memory than is available on your cluster).

There are two possible workarounds:

  1. increase the number of nodes in the cluster (which will increase the amount of distributed memory available); also, be sure to actually use them all when launching the simulation
  2. reduce the size of your simulation (e.g., lower the resolution, shrink the cell_size, etc.)
vinneko commented 5 years ago

Hello,

after long time of struggling, we finally found the main issues.

Some how the procedure fails when we run "mpirun -n Number python my_meep_code.py" when "Number" is bigger than 7. In other words, the procedure fails when we run the code with more than 7 cores. However, we don't have this issue if the python codes don't have "epsilon_input_file".

Is there any intrinsic problem when parallelizing an input epsilon file with too many cores?

Right now we run the job on two nodes but with only 3 cores on each, and it seems to work out fine. We will let you know if we encounter other issues.

Thank you for your help!

oskooi commented 5 years ago

epsilon_input_file should work with any number of chunks. Here is a simple test for debugging which consists of two separate runs: (1) outputting the geometry containing a single Sphere object to an HDF5 via output_epsilon, and (2) loading it back in via epsilon_input_file. The error in the Hy field for the two runs at an arbitrary point in the cell after an arbitrary run time is shown at the end which should be a constant independent of the number of chunks/processes. (The field values from the two runs will not be identical due to discretization effects in the geometry).

(note: script filename is e.g. sphere_eps_h5.py)

import meep as mp

resolution = 10

cell_size = mp.Vector3(5,5,5)

geometry = [mp.Sphere(radius=1.1,
                      center=mp.Vector3(),
                      material=mp.Medium(index=3.5))]

sources = [mp.Source(mp.GaussianSource(1.0,fwidth=0.1),
                     center=mp.Vector3(),
                     component=mp.Ex)]

sim = mp.Simulation(resolution=resolution,
                    cell_size=cell_size,
                    k_point=mp.Vector3(),
                    sources=sources,
                    geometry=geometry)

def get_field(sim):
  return sim.get_field_point(mp.Hy, mp.Vector3(-1.75,0.48,0.52))

sim.run(mp.at_beginning(mp.output_epsilon),
        until=59.41)

hy0 = get_field(sim)

if mp.am_master():
  print("hy0:, {:.8f}".format(hy0.real))

sim.reset_meep()

sim = mp.Simulation(resolution=resolution,
                    cell_size=cell_size,
                    k_point=mp.Vector3(),
                    sources=sources,
                    epsilon_input_file="sphere_eps_h5-eps-000000.00.h5")

sim.run(until=59.41)

hy1 = get_field(sim)

if mp.am_master():
  print("hy1:, {:.6f}".format(hy1.real))
  print("field-error:, {:.6}".format(abs(hy1-hy0)/abs(hy0)))

test 1: chunks = 3

$ mpirun -n 3 python3 sphere_eps_h5.py
Using MPI version 3.1, 3 processes
-----------
Initializing structure...
Splitting into 3 chunks evenly
time for choose_chunkdivision = 0.000219704 s
Working in 3D dimensions.
Computational cell is 5 x 5 x 5 with resolution 10
     sphere, center = (0,0,0)
          radius 1.1
          dielectric constant epsilon diagonal = (12.25,12.25,12.25)
time for set_epsilon = 0.21753 s
-----------
creating output file "./sphere_eps_h5-eps-000000.00.h5"...
run 0 finished at t = 59.45 (1189 timesteps)
hy0:, 0.14338685
-----------
Initializing structure...
read in 51x51x51 epsilon-input-file "sphere_eps_h5-eps-000000.00.h5"
Splitting into 3 chunks evenly
time for choose_chunkdivision = 5.6979e-05 s
Working in 3D dimensions.
Computational cell is 5 x 5 x 5 with resolution 10
time for set_epsilon = 0.0694618 s
-----------
run 0 finished at t = 59.45 (1189 timesteps)
hy1:, -0.046553
field-error:, 1.32467

test 2: chunks = 7 (note: only splits into 6 chunks)

$ mpirun -n 7 python3 sphere_eps_h5.py
-----------
Initializing structure...
Splitting into 6 chunks evenly
time for choose_chunkdivision = 0.000634751 s
Working in 3D dimensions.
Computational cell is 5 x 5 x 5 with resolution 10
     sphere, center = (0,0,0)
          radius 1.1
          dielectric constant epsilon diagonal = (12.25,12.25,12.25)
time for set_epsilon = 0.484033 s
-----------
creating output file "./sphere_eps_h5-eps-000000.00.h5"...
run 0 finished at t = 59.45 (1189 timesteps)
hy0:, 0.14338685
-----------
Initializing structure...
read in 51x51x51 epsilon-input-file "sphere_eps_h5-eps-000000.00.h5"
Splitting into 6 chunks evenly
time for choose_chunkdivision = 0.000194014 s
Working in 3D dimensions.
Computational cell is 5 x 5 x 5 with resolution 10
time for set_epsilon = 0.0415664 s
-----------
run 0 finished at t = 59.45 (1189 timesteps)
hy1:, -0.046553
field-error:, 1.32467

test 3: chunks = 16

$  mpirun -n 16 python3 sphere_eps_h5.py
Using MPI version 3.1, 16 processes
-----------
Initializing structure...
Splitting into 16 chunks evenly
time for choose_chunkdivision = 0.00228932 s
Working in 3D dimensions.
Computational cell is 5 x 5 x 5 with resolution 10
     sphere, center = (0,0,0)
          radius 1.1
          dielectric constant epsilon diagonal = (12.25,12.25,12.25)
time for set_epsilon = 0.0703754 s
-----------
creating output file "./sphere_eps_h5-eps-000000.00.h5"...
Meep progress: 51.050000000000004/59.41 = 85.9% done in 4.0s, 0.7s to go
run 0 finished at t = 59.45 (1189 timesteps)
hy0:, 0.14338685
-----------
Initializing structure...
read in 51x51x51 epsilon-input-file "sphere_eps_h5-eps-000000.00.h5"
Splitting into 16 chunks evenly
time for choose_chunkdivision = 0.00103814 s
Working in 3D dimensions.
Computational cell is 5 x 5 x 5 with resolution 10
time for set_epsilon = 0.0648689 s
-----------
run 0 finished at t = 59.45 (1189 timesteps)
hy1:, -0.046553
field-error:, 1.32467

For all three test cases, the field values are unchanged indicating that epsilon_input_file is being properly read.

vinneko commented 4 years ago

Hello, everyone,

Just want to update the progress we have made. We have tried the test code (sphere_eps_h5.py), and it went well with no problem. We even increased the cell size and the resolution to make the input file as big as the one we are interested in, and it worked out fine. We tried the test code with 2 nodes and 64 cores on each (128 cores in total), and the simulation finished with no error.

As for out own code, we have finally run the simulation successfully when we assign the job to 3 nodes and 2 cores on each (6 cores in total). In other words, we could run the code when we require enough memory. However, when we assign the job to more than 8 cores in total, for example, 4 nodes and 2 cores on each, the job ended with an error, the same error we posted in the very beginning.

We are still investigate the possible bugs, but please let us know if you have any ideas or suggestions.

Thank you so much!

vinneko commented 4 years ago

Hello,

I would like to report some observations from our side regarding to this "epsilon_input_file" command.

Recently, I tested the sphere_eps_h5.py script given above with different parameters. Here is what I did:

First of all, I requested the resourced from the cluster by doing

srun --nodes=2 --ntasks=64 --mem=100000 --pty bash

I wanted to use interactive mode to test different cases.

Secondly, I edited the sphere_eps_h5.py parameters by expanding the cell size:

resolution = 16 cell_size = mp.Vector3(32,32,32)

Then I ran the simulation:

mpirun -n 64 -ppn 32 python sphere_eps_h5.py

The simulation with this parameters finished properly.

However, when I changed the resolution to 32, and executed the same command again, the simulation ran for a while and then failed with error message:

_ BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES PID 219110 RUNNING AT csk020 EXIT CODE: 11 CLEANING UP REMAINING PROCESSES YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

[proxy:0:0@csk016.cluster] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed [proxy:0:0@csk016.cluster] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0@csk016.cluster] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event srun: error: csk016: task 0: Exited with exit code 7 [mpiexec@csk016.cluster] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes term inated badly; aborting [mpiexec@csk016.cluster] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error wai ting for completion [mpiexec@csk016.cluster] HYD_pmci_wait_for_completion (pm/pmiserv/pmiservpmci.c:218): launcher returned error waiting f or completion [mpiexec@csk016.cluster] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

The only thing I changed was the size of the cell, so one might think this is a memory issue.

But how can it be? I requested about 100GB for the job, and the sphere_eps_h5-eps-000000.00.h5 is only 8GB in this case.

Please let me know if there is anything I did wrong during this test.

Thank you so much!

stevengj commented 4 years ago

Some process is crashing for some reason, but it's hard to diagnose the problem until we can get a stack trace that identifies where the crash is occurring.

Here is various info on attaching debuggers to MPI processes — running inside a debugger (with Meep compiled with -g, i.e. --enable-debug) will give you the stack traces: https://stackoverflow.com/questions/329259/how-do-i-debug-an-mpi-program

(Always easier to debug if you can figure out a way to reproduce locally rather than on an AWS cluster.)