MPI python demo tests fail via test.py when pytest-mpi plugin is installed

drew-parsons commented 2 years ago

Dolfinx (0.4.1) provides python demos in python/demo, together with test.py to run the tests with pytest.

On a debian build the demos run successfully on their own, e.g. mpiexec -np 3 python3 demo_poisson.py generates poisson.xdmf as expected in the out_poisson dir. We don't have pyvista, but the try: except block prevents that from triggering an error.

Running demo/test.py in serial mode succeeds. But trying to run the demos with pytest and mpi fails, e.g. (testing only the poisson demo for simplicity)

/projects/fenics-dolfinx/python/demo$ python3 -m pytest -v -m mpi --durations=20 -k poisson test.py --mpiexec=mpiexec --num-proc=3 --with-mpi
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/projects/fenics-dolfinx/python/demo/.hypothesis/examples')
rootdir: /projects/fenics-dolfinx/python/demo, configfile: pytest.ini
plugins: xvfb-2.0.0, arraydiff-0.5.0, cov-3.0.0, mock-3.7.0, hypothesis-6.36.0, flaky-3.7.0, astropy-header-0.2.1, remotedata-0.3.3, filter-subpackage-0.1.1, doctestplus-0.12.0, asyncio-0.18.3, mpi-0.6, openfiles-0.5.0, astropy-0.10.0
asyncio: mode=auto
collected 26 items / 25 deselected / 1 selected                                                                                                                                                                   

test.py::test_demos_mpi[path9-demo_poisson.py] FAILED                                                                                                                                                       [100%]

==================================================================================================== FAILURES =====================================================================================================
______________________________________________________________________________________ test_demos_mpi[path9-demo_poisson.py] ______________________________________________________________________________________

num_proc = '3', mpiexec = 'mpiexec', path = PosixPath('/projects/fenics-dolfinx/python/demo'), name = 'demo_poisson.py'

    @pytest.mark.mpi
    @pytest.mark.parametrize("path,name", demos)
    def test_demos_mpi(num_proc, mpiexec, path, name):
        cmd = [mpiexec, "-np", str(num_proc), sys.executable, name]
        print(cmd)
>       ret = subprocess.run(cmd, cwd=str(path), check=True)

test.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

input = None, capture_output = False, timeout = None, check = True, popenargs = (['mpiexec', '-np', '3', '/usr/bin/python3', 'demo_poisson.py'],)
kwargs = {'cwd': '/projects/fenics-dolfinx/python/demo'}, process = <Popen: returncode: 1 args: ['mpiexec', '-np', '3', '/usr/bin/python3', 'dem...>, stdout = None, stderr = None
retcode = 1

    def run(*popenargs,
            input=None, capture_output=False, timeout=None, check=False, **kwargs):
        """Run command with arguments and return a CompletedProcess instance.

        The returned instance will have attributes args, returncode, stdout and
        stderr. By default, stdout and stderr are not captured, and those attributes
        will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.

        If check is True and the exit code was non-zero, it raises a
        CalledProcessError. The CalledProcessError object will have the return code
        in the returncode attribute, and output & stderr attributes if those streams
        were captured.

        If timeout is given, and the process takes too long, a TimeoutExpired
        exception will be raised.

        There is an optional argument "input", allowing you to
        pass bytes or a string to the subprocess's stdin.  If you use this argument
        you may not also use the Popen constructor's "stdin" argument, as
        it will be used internally.

        By default, all communication is in bytes, and therefore any "input" should
        be bytes, and the stdout and stderr will be bytes. If in text mode, any
        "input" should be a string, and stdout and stderr will be strings decoded
        according to locale encoding, or by "encoding" if set. Text mode is
        triggered by setting any of text, encoding, errors or universal_newlines.

        The other arguments are the same as for the Popen constructor.
        """
        if input is not None:
            if kwargs.get('stdin') is not None:
                raise ValueError('stdin and input arguments may not both be used.')
            kwargs['stdin'] = PIPE

        if capture_output:
            if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
                raise ValueError('stdout and stderr arguments may not be used '
                                 'with capture_output.')
            kwargs['stdout'] = PIPE
            kwargs['stderr'] = PIPE

        with Popen(*popenargs, **kwargs) as process:
            try:
                stdout, stderr = process.communicate(input, timeout=timeout)
            except TimeoutExpired as exc:
                process.kill()
                if _mswindows:
                    # Windows accumulates the output in a single blocking
                    # read() call run on child threads, with the timeout
                    # being done in a join() on those threads.  communicate()
                    # _after_ kill() is required to collect that and add it
                    # to the exception.
                    exc.stdout, exc.stderr = process.communicate()
                else:
                    # POSIX _communicate already populated the output so
                    # far into the TimeoutExpired exception.
                    process.wait()
                raise
            except:  # Including KeyboardInterrupt, communicate handled that.
                process.kill()
                # We don't call process.wait() as .__exit__ does that for us.
                raise
            retcode = process.poll()
            if check and retcode:
>               raise CalledProcessError(retcode, process.args,
                                         output=stdout, stderr=stderr)
E               subprocess.CalledProcessError: Command '['mpiexec', '-np', '3', '/usr/bin/python3', 'demo_poisson.py']' returned non-zero exit status 1.

/usr/lib/python3.10/subprocess.py:524: CalledProcessError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
['mpiexec', '-np', '3', '/usr/bin/python3', 'demo_poisson.py']
================================================================================================= MPI Information =================================================================================================
rank: 0
size: 1
MPI version: 3.1
MPI library version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
MPI vendor: Open MPI 4.1.3
mpi4py rc: 
mpi4py config:
 mpicc: /usr/bin/mpicc
 mpicxx: /usr/bin/mpicxx
============================================================================================== slowest 20 durations ===============================================================================================
3.55s setup    test.py::test_demos_mpi[path9-demo_poisson.py]
0.03s call     test.py::test_demos_mpi[path9-demo_poisson.py]

(1 durations < 0.005s hidden.  Use -vv to show these durations.)
============================================================================================= short test summary info =============================================================================================
FAILED test.py::test_demos_mpi[path9-demo_poisson.py] - subprocess.CalledProcessError: Command '['mpiexec', '-np', '3', '/usr/bin/python3', 'demo_poisson.py']' returned non-zero exit status 1.
======================================================================================== 1 failed, 25 deselected in 3.98s =========================================================================================

A full debian CI test log showing the same problem can be seen at https://ci.debian.net/data/autopkgtest/unstable/amd64/f/fenics-dolfinx/23085308/log.gz

So pytest says the subprocess is failing. But the command in question, mpiexec -np 3 /usr/bin/python3 demo_poisson.py runs perfectly fine on its own.

The problem must lie in the way pytest handles mpi jobs as subprocesses. But the pytest-mpi plugin is supposed to faciltate MPI tests.

Worth noting that the tests previously passed fine, https://ci.debian.net/data/autopkgtest/unstable/amd64/f/fenics-dolfinx/22512767/log.gz At that point the debian system had

dolfinx 0.3.0
pytest 6.2.5
pytest-mpi 0.6
openmpi 4.1.4
mpi4py 3.1.3
petsc 3.16.6

Now in the failing case the updated packages are

dolfinx 0.4.1
pytest 7.1.2
pytest-mpi 0.6

Since the dolfinx demo runs fine outside pytest, we can point the finger at the pytest upgrade. Maybe it means pytest-mpi needs to be patched to match the new pytest.

Does anyone know how to fix the problem?

drew-parsons commented 2 years ago

I note the github dolfinx CI tests here use pytest 7.1.2 and do not use pytest-mpi. The github test is python3 -m pytest -m mpi --num-proc=2 python/demo/test.py.

If I run that in verbose on my system, it tells me --with-mpi needs to be specified

/projects/fenics-dolfinx$ python3 -m pytest -m mpi -v --num-proc=2 python/demo/test.py
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/projects/fenics-dolfinx/.hypothesis/examples')
rootdir: /projects/fenics-dolfinx/python/demo, configfile: pytest.ini
plugins: xvfb-2.0.0, arraydiff-0.5.0, cov-3.0.0, mock-3.7.0, hypothesis-6.36.0, flaky-3.7.0, astropy-header-0.2.1, remotedata-0.3.3, filter-subpackage-0.1.1, doctestplus-0.12.0, asyncio-0.18.3, mpi-0.6, openfiles-0.5.0, astropy-0.10.0
asyncio: mode=auto
collected 26 items / 13 deselected / 13 selected                                                                                                                                                                  

python/demo/test.py::test_demos_mpi[path0-demo_elasticity.py] SKIPPED (need --with-mpi option to run)                                                                                                       [  7%]
python/demo/test.py::test_demos_mpi[path1-demo_gmsh.py] SKIPPED (need --with-mpi option to run)                                                                                                             [ 15%]
python/demo/test.py::test_demos_mpi[path2-demo_lagrange_variants.py] SKIPPED (need --with-mpi option to run)                                                                                                [ 23%]
python/demo/test.py::test_demos_mpi[path3-demo_types.py] SKIPPED (need --with-mpi option to run)                                                                                                            [ 30%]
python/demo/test.py::test_demos_mpi[path4-demo_pyvista.py] SKIPPED (need --with-mpi option to run)                                                                                                          [ 38%]
python/demo/test.py::test_demos_mpi[path5-demo_interpolation-io.py] SKIPPED (need --with-mpi option to run)                                                                                                 [ 46%]
python/demo/test.py::test_demos_mpi[path6-demo_static-condensation.py] SKIPPED (need --with-mpi option to run)                                                                                              [ 53%]
python/demo/test.py::test_demos_mpi[path7-demo_cahn-hilliard.py] SKIPPED (need --with-mpi option to run)                                                                                                    [ 61%]
python/demo/test.py::test_demos_mpi[path8-test.py] SKIPPED (need --with-mpi option to run)                                                                                                                  [ 69%]
python/demo/test.py::test_demos_mpi[path9-demo_poisson.py] SKIPPED (need --with-mpi option to run)                                                                                                          [ 76%]
python/demo/test.py::test_demos_mpi[path10-demo_helmholtz.py] SKIPPED (need --with-mpi option to run)                                                                                                       [ 84%]
python/demo/test.py::test_demos_mpi[path11-conftest.py] SKIPPED (need --with-mpi option to run)                                                                                                             [ 92%]
python/demo/test.py::test_demos_mpi[path12-demo_stokes.py] SKIPPED (need --with-mpi option to run)                                                                                                          [100%]

======================================================================================= 13 skipped, 13 deselected in 0.28s ========================================================================================

So there is a discrepancy between the github test environment and the debian environment, with pytest-mpi being the obvious discrepancy. It's not used in github CI, and if I remove it from my system then the MPI python demo tests proceed (and pass) in the same way as github CI.

So one solution is not to have pytest-mpi installed. It's a bigger (and known) problem with pytest that it loads all available plugins regardless or whether they're wanted, unless -p no:mpi is used. But I can't see any changes in pytest 7 that would cause pytest-mpi to start failing, but clearly there's a problem somewhere.

drew-parsons commented 2 years ago

So the problem is the interaction between the dolfinx python demo tests, pytest (v7 or otherwise) and the pytest-mpi plugin. I'll updating the title to identify pytest-mpi as the core problem.

In my case, I need pytest-mpi installed for the purpose of testing h5py, and the dolfinx test failure is collateral damage. So for me a sufficient work-around is to explicitly apply -p no:mpi to switch off the plugin, running tests with something like

python3 -m pytest -p no:mpi -m mpi --num-proc=2 python/demo/test.py

That combination "-p no:mpi -m mpi" looks a bit weird, the left hand is removing what the right hand is putting in place. But to run the dolfinx python demos in MPI mode it seems to be the only way to do it if the pytest-mpi plugin is installed.

dolfinx/python/demo/test.py creates its own custom marker mpi, i.e. the -m mpi in python3 -m pytest -m mpi is a pytest marker not a python module (the pytest API is confusing!).

Could it be useful for dolfinx to switch to using the pytest-mpi plugin instead?

garth-wells commented 2 years ago

I'm not familiar with pytest-mpi. If it works well and is maintained we could use it. It would allow us to remove some custom pytest decorators.

drew-parsons commented 2 years ago

I'll look into it further and see if it's easy to adapt to it

jorgensd commented 19 hours ago

@drew-parsons is this still an issue?

FEniCS / dolfinx

MPI python demo tests fail via test.py when pytest-mpi plugin is installed #2254