conda-incubator / conda-store

Data science environments, for collaboration. ✨
https://conda.store
BSD 3-Clause "New" or "Revised" License
142 stars 46 forks source link

[BUG] - unit-test conda-store-server (ubuntu) hangs on CI #666

Open nkaretnikov opened 9 months ago

nkaretnikov commented 9 months ago

Describe the bug

Open a no-op PR against current main a5680f855fbf8317b8a9dbac43017ccc4d07927b, like in https://github.com/conda-incubator/conda-store/pull/665. unit-test conda-store-server (ubuntu) hangs.

Expected behavior

Test passes, no hang.

How to Reproduce the problem?

.

Output

No response

Versions and dependencies used.

No response

Anything else?

No response

nkaretnikov commented 9 months ago

Currently looking into this.

nkaretnikov commented 9 months ago

Reported upstream: https://github.com/conda-incubator/conda-docker/issues/23

nkaretnikov commented 9 months ago

When https://github.com/conda-incubator/conda-store/pull/653 was merged 3 days ago, all tests passed (not skipped). Will diff the test logs to see if anything is different now.

nkaretnikov commented 9 months ago

These messages are printed by PyInstaller:

WARNING: file already exists but should not:

See https://github.com/pyinstaller/pyinstaller/blob/554689c49e773d309a4fc7af61ad135464524c62/bootloader/src/pyi_utils.c#L639

You can set this env var to turn warnings into errors, however, conda-docker won't fail as is because no exit-code checking is done

PYINSTALLER_STRICT_UNPACK_MODE=1

See https://github.com/pyinstaller/pyinstaller/blob/554689c49e773d309a4fc7af61ad135464524c62/bootloader/src/pyi_utils.c#L552

nkaretnikov commented 9 months ago

The above warning repros on a CI builder after re-running the problematic test (to populate the test dir) and doing:

fakechroot chroot /tmp/test /_conda.exe --help

I changed tmpdir to /tmp/test in build_docker_environment_image in conda-docker.

The full command (not the help one) works fine without fakechroot.

Looking at strace, this keeps executing new processes, which is why it hangs:

strace -f --trace=vfork,execve fakechroot chroot /tmp/test /_conda.exe --help 2>&1 | grep -v WARN
[pid 43781] execve("/tmp/test/_conda.exe", ["/_conda.exe", "--help"], 0x56212379b090 /* 138 vars */) = 0
strace: Process 43785 attached
[pid 43785] execve("/tmp/test/_conda.exe", ["/_conda.exe", "--help"], 0x562482135e10 /* 138 vars */) = 0
strace: Process 43789 attached
[pid 43789] execve("/tmp/test/_conda.exe", ["/_conda.exe", "--help"], 0x55b5460f3e10 /* 138 vars */) = 0
strace: Process 43792 attached
...

You can also see this in the process tree.

nkaretnikov commented 9 months ago

I can reproduce locally outside of CI:

fakechroot chroot /home/nkaretnikov/.conda/envs/conda-store-dev-old/standalone_conda /conda.exe --version
...
[2174441] WARNING: file already exists but should not: /libmambapy-1.4.1.dist-info/direct_url.json
[2174441] WARNING: file already exists but should not: /libmambapy-1.4.1.dist-info/top_level.txt
conda_package_streaming/package_streaming.py:25: UserWarning: zstandard could not be imported. Running without .conda support.
conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
Error while loading conda entry point: conda-libmamba-solver (libmamba.so.2: cannot open shared object file: No such file or directory)
conda 23.7.2

So, as mentioned before, locally this fails due to libmamba.so.2 not being found, which means the process doesn't loop like on CI.

Still trying to figure out what causes the loop. I suspect it might have something to do with LD environment variables. fakechroot relies on LD_PRELOAD and LD_LIBRARY_PATH is also set (I copied env and grep to the chroot dir to make them accessible):

fakechroot -s chroot /tmp/test/ /sh -c '/env | /grep LD'
FAKECHROOT_LDLIBPATH=/usr/lib:/lib
LD_LIBRARY_PATH=/tmp/test/usr/lib:/tmp/test/lib:/usr/lib:/lib:/tmp/test/usr/lib:/tmp/test/lib
LD_PRELOAD=libfakechroot.so

PyInstaller, which is used by standalone conda, also uses LD_LIBRARY_PATH. People were having issues related to that in that past. Again, it could be that fakechroot just doesn't work with PyInstaller and we didn't notice because libmamba import was failing.

Could be that the versions matter:

I'll try with the same version locally and report back.

nkaretnikov commented 9 months ago

On CI, it doesn't loop if I do this (setting the variable to empty string after it was set by fakechroot):

fakechroot -s chroot /tmp/test/ /sh -c 'LD_PRELOAD='' /_conda.exe --help'

Not sure yet if this is the solution. Since the whole point of fakechroot is to hijack execution via LD_PRELOAD. Could it be that things we needed fakechroot for initially (chroot) are done executing by that point?

UPD: Nope, this cannot be a solution, when you do the proper install command it fails because it still needs to resolve directories according to a fake root, but without LD_PRELOAD it cannot do so.

nkaretnikov commented 9 months ago

I think this might be the cause:

Unlikely; with multiprocessing, the problem is that frameworks tend assume that sys.executable contains python interpreter, so they spawn sub-processes using it in order to run some python command or script. In frozen application, sys.executable points to your frozen executable, so you are spawning an additional copy of your program. Our hooks for multiprocessing monkey-patch multiprocessing.freeze() with custom function that detects such spawned subprocesses (based on sys.argv) and divert the program flow accordingly. That's why multiprocessing works (as long as you call freeze_support), but other frameworks (or custom subbprocess.Popen((sys.executable, ...), ...) do not (unless you explicitly handle the subprocess codepaths in your entry-point script).

https://github.com/pyinstaller/pyinstaller/issues/7165#issuecomment-1281418899

nkaretnikov commented 9 months ago

Yes, it's a PyInstaller/fakechroot-specific problem.

nkaretnikov commented 9 months ago

I've diffed the logs that passed before against the recent ones that fail. I think it might be due to a change in the conda-standalone package that's installed by GitHub Actions:

  + conda-standalone                    [-23.9.0-]{+23.10.0+}  ha770c72_0          conda-forge/linux-64      [-34MB-]{+32MB+}

Again, the cause is the PyInstaller/fakechroot thing. But the above change is what's likely highlighted the problem.

I'm going to try reproducing with an old version of conda-standalone.

nkaretnikov commented 9 months ago

On CI, I tried adding the following to conda-store-server/environment-dev.yaml:

- conda-standalone ==23.9.0

As expected, this makes the tests "pass", but only because:

  1. there's an error when executing conda-docker, which makes the subprocess exit
  2. no exit-code checking is done in conda-docker, so an early exit due to an error is indistinguishable from a success
  3. test_generate_conda_docker currently doesn't validate conda-docker's output in any way.

Here's an actual test result from CI after enabling error checking and enabling stdout/stderr printing:

tests/test_actions.py::test_generate_conda_docker[conda_prefix0] INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 48be4072fe58, initial schema
INFO  [alembic.runtime.migration] Running upgrade 48be4072fe58 -> 8d63a091aff8, Add Environment.description
INFO  [alembic.runtime.migration] Running upgrade 8d63a091aff8 -> 5ad723de2abd, Adding CONTAINER_REGISTRY value to enum
INFO  [alembic.runtime.migration] Running upgrade 5ad723de2abd -> 16f65805dc8f, split conda_package into conda_package and conda_package_build
INFO  [alembic.runtime.migration] Running upgrade 16f65805dc8f -> abd7248d5327, adding a settings table
INFO  [alembic.runtime.migration] Running upgrade abd7248d5327 -> b387747ca9b7, role mapping
INFO  [alembic.runtime.migration] Running upgrade b387747ca9b7 -> d78e9889566a, add status_info
WARNI [traitlets] registry https://registry-1.docker.io not configured using registry without authentication
[6138] Error loading Python lib '/_internal/libpython3.9.so.1.0': dlopen: /tmp/tmpp1ab61zw/_internal/libpython3.9.so.1.0: cannot open shared object file: No such file or directory
FAILED

Changes I had to make to get this:

diff --git a/conda-store-server/conda_store_server/action/base.py b/conda-store-server/conda_store_server/action/base.py
index e0c5e7a..41a2aae 100644
--- a/conda-store-server/conda_store_server/action/base.py
+++ b/conda-store-server/conda_store_server/action/base.py
@@ -16,10 +16,10 @@ def action(f: typing.Callable):
         action_context = ActionContext()
         with contextlib.ExitStack() as stack:
             # redirect stdout -> action_context.stdout
-            stack.enter_context(contextlib.redirect_stdout(action_context.stdout))
+            # stack.enter_context(contextlib.redirect_stdout(action_context.stdout))

             # redirect stderr -> action_context.stdout
-            stack.enter_context(contextlib.redirect_stderr(action_context.stdout))
+            # stack.enter_context(contextlib.redirect_stderr(action_context.stdout))

And changed the fakechroot command as follows in /usr/share/miniconda3/envs/test/lib/python3.10/site-packages/conda_docker/conda.py:

    subprocess.check_output(
        [
            "fakechroot",
            "chroot",
            new_root,
            "/_conda.exe",
            "install",
            "--offline",
            "--file",
            "/opt/conda/pkgs/env.txt",
            "-y",
            "--prefix",
            "/opt/conda",
        ],
        env=env,
        cwd=host_conda_opt,
        # stdout=subprocess.DEVNULL,
        # stderr=subprocess.DEVNULL,
    )
nkaretnikov commented 9 months ago
nkaretnikov commented 9 months ago

Jaime says conda-standalone (_conda.exe above from conda-docker) is not production-ready. It shouldn't be used. Jaime suggests to use micromamba instead.

jaimergp commented 9 months ago

xref https://github.com/conda-incubator/conda-docker/issues/23

nkaretnikov commented 8 months ago

Status update: #667 has landed, which disables conda-docker in conda-store, but a fix still needs to be implemented in conda-docker.

nkaretnikov commented 8 months ago

Marked as blocked because I need to focus on other tasks for now.

nkaretnikov commented 5 months ago

Tania said we need to discuss whether we will be supporting Docker artifact generation going forward.