Possible race with parallel execution

thrix commented 1 month ago

Thanks for creating this awesome pytest plugin!

We are running into ocassionally errors like this when testing with pytest-container in parallel in the tmt project:

plugins: xdist-3.6.1, testinfra-10.1.1, pytest_container-0.4.2, anyio-3.7.1
created: 4/4 workers
4 workers [267 items]
....
/tmp/tmp.brJePL3Qbj/lib64/python3.12/site-packages/pytest_container/container.py:1074: in launch_container
    release_lock()
        lock       = <filelock._unix.UnixFileLock object at 0x7f4aa7d048c0>
        release_lock = <function ContainerLauncher.launch_container.<locals>.release_lock at 0x7f4aa7c3ec00>
        self       = ContainerLauncher(container=Container(url='localhost/tmt/tests/container/fedora/40/upstream:latest',
                                      container_id='',
                                      entry_point=<EntrypointSelection.AUTO: 1>,
                                      custom_entry_point=None,
                                      extra_launch_args=[],
                                      extra_entrypoint_args=[],
                                      healthcheck_timeout=None,
                                      extra_environment_variables=None,
                                      singleton=False,
                                      forwarded_ports=[],
                                      volume_mounts=[],
                                      _is_local=True),
                  container_runtime=PodmanRuntime(build_command=['buildah',
                                                                 'bud',
                                                                 '--layers',
                                                                 '--force-rm'],
                                                  runner_binary='podman',
                                                  _runtime_functional=True),
                  rootdir=PosixPath('/var/ARTIFACTS/work-extended-unit-testsbq0eoa5c/plans/features/extended-unit-tests/discover/default-0/tests'),
                  extra_build_args=[],
                  extra_run_args=['--label',
                                  'pytest_container.request=<SubRequest '
                                  "'container' for <Function "
                                  'test_install_nonexistent[localhost/tmt/tests/container/fedora/40/upstream:latest '
                                  '/ dnf]>>',
                                  '--label',
                                  'pytest_container.node.name=test_install_nonexistent[localhost/tmt/tests/container/fedora/40/upstream:latest '
                                  '/ dnf]',
                                  '--label',
                                  'pytest_container.scope=function',
                                  '--label',
                                  'pytest_container.path=/var/ARTIFACTS/work-extended-unit-testsbq0eoa5c/plans/features/extended-unit-tests/discover/default-0/tests/tests/unit/test_package_managers.py'],
                  container_name='',
                  _expose_ports=True,
                  _new_port_forwards=[],
                  _container_id=None,
                  _stack=<contextlib.ExitStack object at 0x7f4aa7d04d10>,
                  _cidfile='/tmp/9ff87fff-a83f-4d73-9eb8-0489bcbfc126')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def release_lock() -> None:
        _logger.debug("Releasing lock %s", lock.lock_file)
        lock.release()
>       os.unlink(lock.lock_file)
E       FileNotFoundError: [Errno 2] No such file or directory: '/tmp/15a4eab94d676c16ac9314f4e1e336d1.lock'

lock       = <filelock._unix.UnixFileLock object at 0x7f4aa7d048c0>

/tmp/tmp.brJePL3Qbj/lib64/python3.12/site-packages/pytest_container/container.py:1056: FileNotFoundError

Full logs here:

https://artifacts.dev.testing-farm.io/c003ad12-df6f-41df-abe7-55f857ca2d16/work-extended-unit-testsbq0eoa5c/plans/features/extended-unit-tests/execute/data/guest/default-0/tests/unit/with-system-packages/extended-1/output.txt

@dcermak aware of any issue around parallelization?

Anyway, we plan to debug it more, and hopefully provide a patch, wanted just to share our experience.

FTR the unit tests are here:

https://github.com/teemtee/tmt/blob/main/tests/unit/test_package_managers.py

dcermak commented 1 month ago

@dcermak aware of any issue around parallelization?

I am not aware of any issues myself. I can only imagine the following scenario: You launch multiple containers from the same Container instance that get prepared in parallel. The first one finishes preparing and unlocks the lockfile here: https://github.com/dcermak/pytest_container/blob/1e19e1254e4f058e7aa48c098f4eb8ff05ac6421/pytest_container/container.py#L1094

Then something pauses that thread and another container from the same Container instance is prepared and finishes completely, i.e. also including the lockfile removal in https://github.com/dcermak/pytest_container/blob/1e19e1254e4f058e7aa48c098f4eb8ff05ac6421/pytest_container/container.py#L1095

The previous thread gets unpaused, tries to remove the lockfile and dies :skull_and_crossbones:

I must confess that I find this imaginary scenario a bit unlikely but possible. We could change the code to the following and tolerate already deleted lockfiles by using pathlibs' Path.unlink:

        def release_lock() -> None:
            _logger.debug("Releasing lock %s", lock.lock_file)
            lock.release()
-            os.unlink(lock.lock_file)
+            Path(lock.lock_file).unlink(missing_ok=True)

I doubt that this could have any negative consequences (I know, famous last words)

thrix commented 1 month ago

@dcermak sounds like a plan, thanks for the quick follow up :)

thrix commented 1 month ago

@dcermak also sounds like maybe we could change something in the code to use multiple Container instances rather, to also avoid the problem?

Chiming in @happz

dcermak commented 1 month ago

@dcermak also sounds like maybe we could change something in the code to use multiple Container instances rather, to also avoid the problem?

Chiming in @happz

No, that won't avoid the problem as the lock file path is inferred from the container's attributes. Two instances with the same properties should have the same lock file path.

You might want to reconsider which fixture you are using though. It looks like you are performing mutating tests, but are using the container fixture. This fixture will use the same (podman/docker) container for all test functions. I.e. mutations from a previous test function can influence another. If you want to prevent this from happening, use the container_per_test fixture.

happz commented 1 month ago

@dcermak also sounds like maybe we could change something in the code to use multiple Container instances rather, to also avoid the problem? Chiming in @happz

No, that won't avoid the problem as the lock file path is inferred from the container's attributes. Two instances with the same properties should have the same lock file path.

You might want to reconsider which fixture you are using though. It looks like you are performing mutating tests, but are using the container fixture. This fixture will use the same (podman/docker) container for all test functions. I.e. mutations from a previous test function can influence another. If you want to prevent this from happening, use the container_per_test fixture.

I believe we sre using both: some of our tests do not modify the container, e.g. when we test an attempt to install nonexistent package. I mean, not modify in an interesting way - package may fetch repository metadata, but that’s fine for our use case. Amd test where we expect changes should be using dedicated containers per testcase.

But I will review it on Monday so we are sure of it. The concept is known to us, let’s make sure we use it as exected while we’re hunting down this kind of issues.

happz commented 4 weeks ago

AFAICT, we're using the fixtures correctly. We do have two pairs of them: guest + container, and guest_per_test + container_per_test. The first set is used for read-only-ish tests, the second set for tests that actually change the set of installed packages.

On the other hand, we try to rebuild images to inject repository metadata, which is usually the most costly part of running a package manager in a container, and therefore it shouldn't hurt us too much if we switch to container_per_test completely. I'm going to run tests a couple of times, measure their duration and we'll see.

dcermak / pytest_container

Possible race with parallel execution #232