Closed ctjacobs closed 10 years ago
This has also happened at lines such as projected = project(self.solution_old.split()[dim], self.W.sub(dimension))
.
I suspect this is a race condition calling FFC to compile the project
kernel, which we're currently not protecting against. The solution is doing what we're currently doing for the PyOP2 Compiler. I'll push a fix for that on the split-forms branch.
Did this fix ever go in? I'm still getting gzip
related errors in parallel:
Traceback (most recent call last):
File "../../models/shallow_water.py", line 535, in <module>
sw.run()
File "../../models/shallow_water.py", line 486, in run
'ksp_rtol': 1.0e-7})
File "/home/christian/firedrake/firedrake/solving.py", line 865, in solve
_solve_varproblem(*args, **kwargs)
File "/home/christian/firedrake/firedrake/solving.py", line 907, in _solve_varproblem
nullspace=nullspace)
File "/home/christian/firedrake/firedrake/solving.py", line 112, in __init__
self._jac_tensor = assemble(self._problem.J_ufl, bcs=self._problem.bcs)
File "/home/christian/firedrake/firedrake/solving.py", line 372, in assemble
return _assemble(f, tensor=tensor, bcs=_extract_bcs(bcs))
File "/home/christian/firedrake/firedrake/solving.py", line 398, in _assemble
kernels = compile_form(f, "form")
File "/home/christian/firedrake/firedrake/ffc_interface.py", line 175, in compile_form
kernel, = FFCKernel(form, name + str(i) + str(j)).kernels
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 174, in __new__
return cls._cache_lookup(key)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 236, in _cache_lookup
return cls._cache.get(key) or cls._read_from_disk(key)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 243, in _read_from_disk
val = cPickle.load(f)
File "/usr/lib/python2.7/gzip.py", line 450, in readline
c = self.read(readsize)
File "/usr/lib/python2.7/gzip.py", line 256, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 291, in _read
self._read_gzip_header()
File "/usr/lib/python2.7/gzip.py", line 185, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
: Not a gzipped file
Traceback (most recent call last):
File "../../models/shallow_water.py", line 535, in <module>
sw.run()
File "../../models/shallow_water.py", line 486, in run
'ksp_rtol': 1.0e-7})
File "/home/christian/firedrake/firedrake/solving.py", line 865, in solve
_solve_varproblem(*args, **kwargs)
File "/home/christian/firedrake/firedrake/solving.py", line 907, in _solve_varproblem
nullspace=nullspace)
File "/home/christian/firedrake/firedrake/solving.py", line 112, in __init__
self._jac_tensor = assemble(self._problem.J_ufl, bcs=self._problem.bcs)
File "/home/christian/firedrake/firedrake/solving.py", line 372, in assemble
return _assemble(f, tensor=tensor, bcs=_extract_bcs(bcs))
File "/home/christian/firedrake/firedrake/solving.py", line 398, in _assemble
kernels = compile_form(f, "form")
File "/home/christian/firedrake/firedrake/ffc_interface.py", line 175, in compile_form
kernel, = FFCKernel(form, name + str(i) + str(j)).kernels
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 174, in __new__
return cls._cache_lookup(key)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 236, in _cache_lookup
return cls._cache.get(key) or cls._read_from_disk(key)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/caching.py", line 243, in _read_from_disk
val = cPickle.load(f)
File "/usr/lib/python2.7/gzip.py", line 450, in readline
c = self.read(readsize)
File "/usr/lib/python2.7/gzip.py", line 256, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 291, in _read
self._read_gzip_header()
File "/usr/lib/python2.7/gzip.py", line 185, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 24875 on
node elevate exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[elevate:24874] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[elevate:24874] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
This has not been fixed. I'll have a think about how to do it.
Hopefully fixed in current PyOP2/Firedrake master.
The previous errors no longer occur, but I'm now getting another cache-related error when running my shallow water model in parallel. Unfortunately I can't seem to reproduce this with the demos or a simple example. Sometimes the processes will just hang at 100% CPU usage after printing "Compiler stage 3 finished", while other times it will give:
FFC finished in 0.126655 seconds.
[0] pyop2:INFO Compiling wrapper...
[0] pyop2:INFO Compiling wrapper...done
[0] pyop2:INFO Compiling wrapper...
[0] pyop2:INFO Compiling wrapper...done
[0] pyop2:INFO Compiling wrapper...
[0] pyop2:INFO Compiling wrapper...done
[0] pyop2:INFO Compiling wrapper...
[0] pyop2:INFO Compiling wrapper...done
[0] pyop2:INFO Compiling wrapper...
[0] pyop2:INFO Compiling wrapper...done
[0] pyop2:INFO Compiling wrapper...
[0] pyop2:INFO Compiling wrapper...done
[0] pyop2:INFO Compiling wrapper...
[0] pyop2:INFO Compiling wrapper...done
Traceback (most recent call last):
File "../../../models/shallow_water.py", line 538, in <module>
sw.run()
File "../../../models/shallow_water.py", line 489, in run
'snes_type': 'ksponly'})
File "/home/christian/firedrake/firedrake/solving.py", line 865, in solve
_solve_varproblem(*args, **kwargs)
File "/home/christian/firedrake/firedrake/solving.py", line 909, in _solve_varproblem
solver.solve()
File "/home/christian/firedrake/firedrake/solving.py", line 268, in solve
self.snes.solve(None, v)
File "SNES.pyx", line 413, in petsc4py.PETSc.SNES.solve (src/petsc4py.PETSc.c:143714)
File "petscsnes.pxi", line 225, in petsc4py.PETSc.SNES_Function (src/petsc4py.PETSc.c:29553)
File "/home/christian/firedrake/firedrake/solving.py", line 193, in form_function
with self._F_tensor.dat.vec_ro as v:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/petsc_base.py", line 172, in vecscatter
with acc(d) as v:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/petsc_base.py", line 89, in vec_context
self._force_evaluation()
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 1482, in _force_evaluation
_trace.evaluate(reads, writes)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 150, in evaluate
comp._run()
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 3456, in _run
return self.compute()
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/base.py", line 3463, in compute
self._compute(self.it_space.iterset.core_part)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/sequential.py", line 148, in _compute
[0] pyop2:INFO Compiling wrapper...
fun(*self._jit_args, argtypes=self._argtypes, restype=None)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/host.py", line 657, in __call__
return self.compile(argtypes, restype)(*args)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/host.py", line 732, in compile
restype=restype)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/compilation.py", line 200, in load
dll = compiler.get_so(src)
File "/usr/local/lib/python2.7/dist-packages/PyOP2-0.10.0-py2.7-linux-x86_64.egg/pyop2/compilation.py", line 143, in get_so
return ctypes.CDLL(soname)
File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /tmp/pyop2-cache-uid1000/93881e8b5cbe9c94f6383583bf4662c2.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 1 with PID 5207 on
node elevate exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
even after deleting /tmp/pyop2-cache-uid1000 and /tmp/firedrake-* and trying again.
Is this on a system where not all processes can see the same temp filesystem?
No, I think all processes can see the same /tmp - this is on my laptop running with mpiexec -n 2
on a dual-core processor.
aha, what do I need to run to attempt to reproduce?
The error sporadically occurs when running the dvs_channel simulation in the firedrake-fluids branch: mpiexec -n 2 python models/shallow_water.py tests/dvs_channel/dvs_channel.swml
When I try running in parallel I get a: ValueError: total size of new array must be unchanged error
when building the Mesh.
That's issue #222 which was fixed by Michael (but the fix is currently in petsc-next not petsc-master).
Aha, ok, I'll have a go tomorrow then when I've rebuilt petsc.
You might also encounter issue #236 - I have been using Florian's short-term fix in firedrake/ffc_interface.py.
Also, I've locally reverted this commit in UFL to work around issue #237.
This is unfortunately still an issue. Any progress on this?
OK, my crystal ball suggests that the problem is due to boundary ids not being the same on all processes (and you therefore have a hang when compiling code). In particular the error
OSError: /tmp/pyop2-cache-uid1000/93881e8b5cbe9c94f6383583bf4662c2.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
Is symptomatic of this. Rank 1 (which doesn't do code compilation) tried to open a shared library which wasn't there. This means that rank 0 didn't compile it.
Note if you run in Pyop2 debug mode (export PYOP2_DEBUG=1), we try and catch this by checking if the generated code is the same on all processes when going into compilation.
The other symptom (hanging at 100% CPU) suggests that process 0 got there first, compiled code for its kernel, but the two processes are no longer running in data parallel mode (because they didn't run down the same code path). So probably one of them is spinning in MPI_Waitall (waiting for halo exchanges) while the other is sitting at an MPI_Barrier.
With my fix branch, I can't reproduce this, so can you try that please?
Thanks. The changes in the fix/parallel_ext_facets branch have fixed this issue.
FWIW, I have also sporadically seen the
cannot open shared object file: No such file or directory
error when running the Navier-Stokes benchmark in parallel on a single cx1 node even after #302 was merged.
It turns out the failure is not sporadic, but systematic and always happens for 4 and 12 processes. That suggests it somehow relates to the way the mesh is partitioned. Running with PYOP2_DEBUG
confirms that the code differs on different ranks.
Should this check always be active? It's pretty expensive since it involves an allgather.
Regardless, we should probably also dump the generated code, since otherwise debugging such a failure on a cluster backend node is near impossible when the cache is cleared afterwards. In fact, if the exception is raised, there is hardly any way to debug. Otherwise at least you get the path to the expected so
and can inspect the generated code, if you (still) have access to the cache.
So I suspect this will be something to do with the BC application, but I can't immediately see something in the code. Can you reproduce somewhere where debugging is more possible. The best thing to do is to run with debug enabled and drop into the Python debugger when the exception occurs, you can then see which par_loop call actually caused the problem. However, I agree it's probably useful to dump the generated code and point to it if it is a problem. I do not think it should always be active.
I couldn't reproduce the failure on foraker and now I can't seem to be able to reproduce it on cx1 either.
I was able to track down the par loop however and dump the generated code on different ranks. It was indeed a BC application:
[bc.apply(A2, b2) for bc in bcp]
and specifically the function assign.
The code differed only on rank 0, where the function is set to 0, whereas on all other ranks another function is assigned (in this case the Constant
p_in
).
So that code, AIUI, doesn't actually do any par_loops in the case that A2 is a matrix and b2 a Function (it just adds a bc to the matrix). Oh, hold on. The Matrix object stores the bcs as a set, which doesn't have a guaranteed iteration order. So I bet what happens is that on rank zero you iterated through the bcs as [inflow, outflow] whereas on the other ranks you happened to do [outflow, inflow]. So I think the thing to do is to maintain an ordered list of unique bcs when calling add_bc on the matrix.
It's possible you can't reproduce now because the cache is populated.
Yes, I'm pretty sure you're right about set traversal order. I'll file a new issue since that needs fixing.
When running Firedrake in parallel on one 8-core node on CX1, I've had this error sporadically appear when
project
is called: