ctjacobs commented 10 years ago

When running Firedrake in parallel on one 8-core node on CX1, I've had this error sporadically appear when project is called:

[ctj10@login-3 test]$ cat linc_z.e6930962 
[3] pyop2:WARNING *** Projecting output function to CG1
[7] pyop2:WARNING *** Projecting output function to CG1
[4] pyop2:WARNING *** Projecting output function to CG1
[1] pyop2:WARNING *** Projecting output function to CG1
[0] pyop2:WARNING *** Projecting output function to CG1
[6] pyop2:WARNING *** Projecting output function to CG1
[2] pyop2:WARNING *** Projecting output function to CG1
Traceback (most recent call last):
  File "/tmp/pbs.6930962.cx1b/test.py", line 11, in <module>
    File("test.pvd") << f
  File "/home/op2-devel/firedrake/master/firedrake/io.py", line 98, in __lshift__
    self._file << data
  File "/home/op2-devel/firedrake/master/firedrake/io.py", line 208, in __lshift__
    output = project(function, Vo, name=function.name())
  File "/home/op2-devel/firedrake/master/firedrake/projection.py", line 78, in project
    form_compiler_parameters=form_compiler_parameters)
  File "/home/op2-devel/firedrake/master/firedrake/solving.py", line 916, in solve
    _solve_varproblem(*args, **kwargs)
  File "/home/op2-devel/firedrake/master/firedrake/solving.py", line 945, in _solve_varproblem
    nullspace=nullspace)
  File "/home/op2-devel/firedrake/master/firedrake/solving.py", line 333, in __init__
    super(LinearVariationalSolver, self).__init__(*args, **kwargs)
  File "/home/op2-devel/firedrake/master/firedrake/solving.py", line 111, in __init__
    self._jac_tensor = assemble(self._problem.J_ufl, bcs=self._problem.bcs)
  File "/home/op2-devel/firedrake/master/firedrake/solving.py", line 364, in assemble
    return _assemble(f, tensor=tensor, bcs=_extract_bcs(bcs))
  File "/home/op2-devel/firedrake/master/firedrake/solving.py", line 383, in _assemble
    kernels = ffc_interface.compile_form(f, "form")
  File "/home/op2-devel/pyop2/master/pyop2/ffc_interface.py", line 124, in compile_form
[5] pyop2:WARNING *** Projecting output function to CG1
    return FFCKernel(form, name).kernels
  File "/home/op2-devel/pyop2/master/pyop2/caching.py", line 59, in __new__
    return cls._cache_lookup(key)
  File "/home/op2-devel/pyop2/master/pyop2/caching.py", line 121, in _cache_lookup
    return cls._cache.get(key) or cls._read_from_disk(key)
  File "/home/op2-devel/pyop2/master/pyop2/caching.py", line 128, in _read_from_disk
    val = cPickle.load(f)
  File "/apps/python/2.7.3/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/apps/python/2.7.3/lib/python2.7/gzip.py", line 320, in _read
    self._read_eof()
  File "/apps/python/2.7.3/lib/python2.7/gzip.py", line 338, in _read_eof
    crc32 = read32(self.fileobj)
  File "/apps/python/2.7.3/lib/python2.7/gzip.py", line 25, in read32
    return struct.unpack("<I", input.read(4))[0]
struct.error: unpack requires a string argument of length 4
[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

ctjacobs commented 10 years ago

This has also happened at lines such as projected = project(self.solution_old.split()[dim], self.W.sub(dimension)).

kynan commented 10 years ago

I suspect this is a race condition calling FFC to compile the project kernel, which we're currently not protecting against. The solution is doing what we're currently doing for the PyOP2 Compiler. I'll push a fix for that on the split-forms branch.

ctjacobs commented 10 years ago

Did this fix ever go in? I'm still getting gzip related errors in parallel:

Traceback (most recent call last):
  File "../../models/shallow_water.py", line 535, in <module>
    sw.run()
  File "../../models/shallow_water.py", line 486, in run
    'ksp_rtol': 1.0e-7})
  File "/home/christian/firedrake/firedrake/solving.py", line 865, in solve
    _solve_varproblem(*args, **kwargs)
  File "/home/christian/firedrake/firedrake/solving.py", line 907, in _solve_varproblem
    nullspace=nullspace)
  File "/home/christian/firedrake/firedrake/solving.py", line 112, in __init__
    self._jac_tensor = assemble(self._problem.J_ufl, bcs=self._problem.bcs)
  File "/home/christian/firedrake/firedrake/solving.py", line 372, in assemble
    return _assemble(f, tensor=tensor, bcs=_extract_bcs(bcs))
  File "/home/christian/firedrake/firedrake/solving.py", line 398, in _assemble
    kernels = compile_form(f, "form")
  File "/home/christian/firedrake/firedrake/ffc_interface.py", line 175, in compile_form
    kernel, = FFCKernel(form, name + str(i) + str(j)).kernels
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 174, in __new__
    return cls._cache_lookup(key)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 236, in _cache_lookup
    return cls._cache.get(key) or cls._read_from_disk(key)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 243, in _read_from_disk
    val = cPickle.load(f)
  File "/usr/lib/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 291, in _read
    self._read_gzip_header()
  File "/usr/lib/python2.7/gzip.py", line 185, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
: Not a gzipped file
Traceback (most recent call last):
  File "../../models/shallow_water.py", line 535, in <module>
    sw.run()
  File "../../models/shallow_water.py", line 486, in run
    'ksp_rtol': 1.0e-7})
  File "/home/christian/firedrake/firedrake/solving.py", line 865, in solve
    _solve_varproblem(*args, **kwargs)
  File "/home/christian/firedrake/firedrake/solving.py", line 907, in _solve_varproblem
    nullspace=nullspace)
  File "/home/christian/firedrake/firedrake/solving.py", line 112, in __init__
    self._jac_tensor = assemble(self._problem.J_ufl, bcs=self._problem.bcs)
  File "/home/christian/firedrake/firedrake/solving.py", line 372, in assemble
    return _assemble(f, tensor=tensor, bcs=_extract_bcs(bcs))
  File "/home/christian/firedrake/firedrake/solving.py", line 398, in _assemble
    kernels = compile_form(f, "form")
  File "/home/christian/firedrake/firedrake/ffc_interface.py", line 175, in compile_form
    kernel, = FFCKernel(form, name + str(i) + str(j)).kernels
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 174, in __new__
    return cls._cache_lookup(key)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 236, in _cache_lookup
    return cls._cache.get(key) or cls._read_from_disk(key)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 243, in _read_from_disk
    val = cPickle.load(f)
  File "/usr/lib/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 291, in _read
    self._read_gzip_header()
  File "/usr/lib/python2.7/gzip.py", line 185, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 24875 on
node elevate exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[elevate:24874] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[elevate:24874] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

wence- commented 10 years ago

This has not been fixed. I'll have a think about how to do it.

wence- commented 10 years ago

Hopefully fixed in current PyOP2/Firedrake master.

ctjacobs commented 10 years ago

The previous errors no longer occur, but I'm now getting another cache-related error when running my shallow water model in parallel. Unfortunately I can't seem to reproduce this with the demos or a simple example. Sometimes the processes will just hang at 100% CPU usage after printing "Compiler stage 3 finished", while other times it will give:

FFC finished in 0.126655 seconds.
[0] pyop2:INFO   Compiling wrapper...
[0] pyop2:INFO   Compiling wrapper...done
[0] pyop2:INFO   Compiling wrapper...
[0] pyop2:INFO   Compiling wrapper...done
[0] pyop2:INFO   Compiling wrapper...
[0] pyop2:INFO   Compiling wrapper...done
[0] pyop2:INFO   Compiling wrapper...
[0] pyop2:INFO   Compiling wrapper...done
[0] pyop2:INFO   Compiling wrapper...
[0] pyop2:INFO   Compiling wrapper...done
[0] pyop2:INFO   Compiling wrapper...
[0] pyop2:INFO   Compiling wrapper...done
[0] pyop2:INFO   Compiling wrapper...
[0] pyop2:INFO   Compiling wrapper...done
Traceback (most recent call last):
  File "../../../models/shallow_water.py", line 538, in <module>
    sw.run()
  File "../../../models/shallow_water.py", line 489, in run
    'snes_type': 'ksponly'})
  File "/home/christian/firedrake/firedrake/solving.py", line 865, in solve
    _solve_varproblem(*args, **kwargs)
  File "/home/christian/firedrake/firedrake/solving.py", line 909, in _solve_varproblem
    solver.solve()
  File "/home/christian/firedrake/firedrake/solving.py", line 268, in solve
    self.snes.solve(None, v)
  File "SNES.pyx", line 413, in petsc4py.PETSc.SNES.solve (src/petsc4py.PETSc.c:143714)
  File "petscsnes.pxi", line 225, in petsc4py.PETSc.SNES_Function (src/petsc4py.PETSc.c:29553)
  File "/home/christian/firedrake/firedrake/solving.py", line 193, in form_function
    with self._F_tensor.dat.vec_ro as v:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/petsc_base.py", line 172, in vecscatter
    with acc(d) as v:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/petsc_base.py", line 89, in vec_context
    self._force_evaluation()
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 1482, in _force_evaluation
    _trace.evaluate(reads, writes)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 150, in evaluate
    comp._run()
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 3456, in _run
    return self.compute()
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 3463, in compute
    self._compute(self.it_space.iterset.core_part)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/sequential.py", line 148, in _compute
[0] pyop2:INFO   Compiling wrapper...
    fun(*self._jit_args, argtypes=self._argtypes, restype=None)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/host.py", line 657, in __call__
    return self.compile(argtypes, restype)(*args)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/host.py", line 732, in compile
    restype=restype)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/compilation.py", line 200, in load
    dll = compiler.get_so(src)
  File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/compilation.py", line 143, in get_so
    return ctypes.CDLL(soname)
  File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /tmp/pyop2-cache-uid1000/93881e8b5cbe9c94f6383583bf4662c2.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 1 with PID 5207 on
node elevate exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------

even after deleting /tmp/pyop2-cache-uid1000 and /tmp/firedrake-* and trying again.

wence- commented 10 years ago

Is this on a system where not all processes can see the same temp filesystem?

ctjacobs commented 10 years ago

No, I think all processes can see the same /tmp - this is on my laptop running with mpiexec -n 2 on a dual-core processor.

wence- commented 10 years ago

aha, what do I need to run to attempt to reproduce?

ctjacobs commented 10 years ago

The error sporadically occurs when running the dvs_channel simulation in the firedrake-fluids branch: mpiexec -n 2 python models/shallow_water.py tests/dvs_channel/dvs_channel.swml

wence- commented 10 years ago

When I try running in parallel I get a: ValueError: total size of new array must be unchanged error

wence- commented 10 years ago

when building the Mesh.

ctjacobs commented 10 years ago

That's issue #222 which was fixed by Michael (but the fix is currently in petsc-next not petsc-master).

wence- commented 10 years ago

Aha, ok, I'll have a go tomorrow then when I've rebuilt petsc.

ctjacobs commented 10 years ago

You might also encounter issue #236 - I have been using Florian's short-term fix in firedrake/ffc_interface.py.

ctjacobs commented 10 years ago

Also, I've locally reverted this commit in UFL to work around issue #237.

ctjacobs commented 10 years ago

This is unfortunately still an issue. Any progress on this?

wence- commented 10 years ago

OK, my crystal ball suggests that the problem is due to boundary ids not being the same on all processes (and you therefore have a hang when compiling code). In particular the error

OSError: /tmp/pyop2-cache-uid1000/93881e8b5cbe9c94f6383583bf4662c2.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

Is symptomatic of this. Rank 1 (which doesn't do code compilation) tried to open a shared library which wasn't there. This means that rank 0 didn't compile it.

Note if you run in Pyop2 debug mode (export PYOP2_DEBUG=1), we try and catch this by checking if the generated code is the same on all processes when going into compilation.

The other symptom (hanging at 100% CPU) suggests that process 0 got there first, compiled code for its kernel, but the two processes are no longer running in data parallel mode (because they didn't run down the same code path). So probably one of them is spinning in MPI_Waitall (waiting for halo exchanges) while the other is sitting at an MPI_Barrier.

With my fix branch, I can't reproduce this, so can you try that please?

ctjacobs commented 10 years ago

Thanks. The changes in the fix/parallel_ext_facets branch have fixed this issue.

wence- commented 10 years ago

302 has landed, so closing.

kynan commented 10 years ago

FWIW, I have also sporadically seen the

cannot open shared object file: No such file or directory

error when running the Navier-Stokes benchmark in parallel on a single cx1 node even after #302 was merged.

kynan commented 10 years ago

It turns out the failure is not sporadic, but systematic and always happens for 4 and 12 processes. That suggests it somehow relates to the way the mesh is partitioned. Running with PYOP2_DEBUG confirms that the code differs on different ranks.

Should this check always be active? It's pretty expensive since it involves an allgather.

Regardless, we should probably also dump the generated code, since otherwise debugging such a failure on a cluster backend node is near impossible when the cache is cleared afterwards. In fact, if the exception is raised, there is hardly any way to debug. Otherwise at least you get the path to the expected so and can inspect the generated code, if you (still) have access to the cache.

wence- commented 10 years ago

So I suspect this will be something to do with the BC application, but I can't immediately see something in the code. Can you reproduce somewhere where debugging is more possible. The best thing to do is to run with debug enabled and drop into the Python debugger when the exception occurs, you can then see which par_loop call actually caused the problem. However, I agree it's probably useful to dump the generated code and point to it if it is a problem. I do not think it should always be active.

kynan commented 10 years ago

I couldn't reproduce the failure on foraker and now I can't seem to be able to reproduce it on cx1 either.

I was able to track down the par loop however and dump the generated code on different ranks. It was indeed a BC application:

[bc.apply(A2, b2) for bc in bcp]

and specifically the function assign.

The code differed only on rank 0, where the function is set to 0, whereas on all other ranks another function is assigned (in this case the Constant p_in).

wence- commented 10 years ago

So that code, AIUI, doesn't actually do any par_loops in the case that A2 is a matrix and b2 a Function (it just adds a bc to the matrix). Oh, hold on. The Matrix object stores the bcs as a set, which doesn't have a guaranteed iteration order. So I bet what happens is that on rank zero you iterated through the bcs as [inflow, outflow] whereas on the other ranks you happened to do [outflow, inflow]. So I think the thing to do is to maintain an ordered list of unique bcs when calling add_bc on the matrix.

It's possible you can't reproduce now because the cache is populated.

kynan commented 10 years ago

Yes, I'm pretty sure you're right about set traversal order. I'll file a new issue since that needs fixing.

firedrakeproject / firedrake

struct.error when calling project in parallel #228

302 has landed, so closing.