Mysterious permissions errors when running test_run.sh twice

chrisvanrun commented 4 days ago

During the workshop run, @koopmant ran into the problem that running the test script twice resulted in permission errors.

The test_run.sh has a secondary docker run, introduced to specifically target these permissions. Root user v.s. build created user. Hence, it is unclear why this occurred. It's being investigated.

koopmant commented 4 days ago

Tried again after a fresh reinstall of my system. I'm on Ubuntu 24.04.1 LTS. Steps I took:

install docker desktop
clone repo
run the test script

Initial run works. Second run stops after:

=+= Cleaning up any earlier output

Removed the -f argument from chmod. Now we see:

chmod: changing permissions of '.../test/output': Operation not permitted

Tried adding my user to the docker group to run docker as non-root and see if that fixes it sudo usermod -aG docker $USER

Did not fix it…

Also tried taking ownership using chown, but get the same permission issue.

chrisvanrun commented 3 days ago

Ok. So it fails on this part somewhere:

if [ -d "$OUTPUT_DIR" ]; then
  # Ensure permissions are setup correctly
  # This allows for the Docker user to write to this location
  rm -rf "${OUTPUT_DIR}"/*
  chmod -f o+rwx "$OUTPUT_DIR"
else
  mkdir --mode=o+rwx "$OUTPUT_DIR"
fi

What are the permissions/ownership of the output directory right when you clone the repo, and what are they after the first run? ls -alh IIRC

chrisvanrun commented 3 days ago

Oh and while we are at it, what does id -u and id -g return at your command line?

koopmant commented 3 days ago

Oh and while we are at it, what does id -u and id -g return at your command line?

both return 1000

output of id command:

uid=1000(thomas) gid=1000(thomas) groups=1000(thomas),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),100(users),114(lpadmin)

koopmant commented 3 days ago

After clone I'm not sure, will check after mkdir I'm the owner (obviously, but I checked regardless)

drwxrwxrwx 2 thomas thomas 4096 Oct 18 14:37 output

after a successfull run, the owner is 100999:

drwxrwxrwx 2 100999 100999 4.0K Oct 18 14:38 output

chrisvanrun commented 3 days ago

Thanks for testing. Huh, 100999 is odd! Given your id is 1000, I would expect that this (at the end of the test_run.sh) fixes things:

docker run --rm \
    --quiet \
    --env HOST_UID=`id -u` \
    --env HOST_GID=`id -g` \
    --volume "$OUTPUT_DIR":/output \
    alpine:latest \
    /bin/sh -c 'chown -R ${HOST_UID}:${HOST_GID} /output'

koopmant commented 3 days ago

Well... When I remove that section, it actually works just fine? If I don´t run that section, the owner of the files in output/ is 100998 btw. But the owner of the output folder itself is still me.

koopmant commented 3 days ago

~~I think it has something to do with this; but I don't fully understand it yet https://docs.docker.com/engine/security/userns-remap/~~

https://docs.docker.com/desktop/faqs/linuxfaqs/#how-do-i-enable-file-sharing

In this scenario if a shared file is chowned inside a Docker Desktop container owned by a user with a UID of 1000, it shows up on the host as owned by a user with a UID of 100999. This has the unfortunate side effect of preventing easy access to such a file on the host. The problem is resolved by creating a group with the new GID and adding our user to it, or by setting a recursive ACL (see setfacl(1)) for folders shared with the Docker Desktop VM.

chrisvanrun commented 3 days ago

Current suspect: with set -e enabled, if a docker run has an error, the permission setter is not run. Should catch it in an exit so it always runs.

koopmant commented 3 days ago

disabling set -e does not change anything on my side. permission issue persists.

chrisvanrun commented 3 days ago

@koopmant could you try a test_run.sh with this:

#!/usr/bin/env bash

# Stop at first error
set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
DOCKER_TAG="example-algorithm"
DOCKER_NOOP_VOLUME="${DOCKER_TAG}-volume"

INPUT_DIR="${SCRIPT_DIR}/test/input"
OUTPUT_DIR="${SCRIPT_DIR}/test/output"

cleanup() {
    echo "=+=  Cleaning up ..."
    # Ensure permissions are set correctly on the output
    # This allows the host user (e.g. you) to access and handle these files
    docker run --rm \
      --quiet \
      --env HOST_UID=`id -u` \
      --env HOST_GID=`id -g` \
      --volume "$OUTPUT_DIR":/output \
      alpine:latest \
      /bin/sh -c 'chown -R ${HOST_UID}:${HOST_GID} /output'
}

trap cleanup EXIT

if [ -d "$OUTPUT_DIR" ]; then
  # Ensure permissions are setup correctly
  # This allows for the Docker user to write to this location
  cleanup

  rm -rf "${OUTPUT_DIR}"/*
  chmod -f o+rwx "$OUTPUT_DIR"
else
  mkdir -m o+rwx "$OUTPUT_DIR"
fi

echo "=+= (Re)build the container"
docker build "$SCRIPT_DIR" \
  --platform=linux/amd64 \
  --tag $DOCKER_TAG 2>&1

echo "=+= Doing a forward pass"
## Note the extra arguments that are passed here:
# '--network none'
#    entails there is no internet connection
# '--volume <NAME>:/tmp'
#   is added because on Grand Challenge this directory cannot be used to store permanent files
docker volume create "$DOCKER_NOOP_VOLUME" > /dev/null

docker run --rm \
    --platform=linux/amd64 \
    --network none \
    --volume "$INPUT_DIR":/input:ro \
    --volume "$OUTPUT_DIR":/output \
    --volume "$DOCKER_NOOP_VOLUME":/tmp \
    $DOCKER_TAG

docker volume rm "$DOCKER_NOOP_VOLUME" > /dev/null

echo "=+= Wrote results to ${OUTPUT_DIR}"

echo "=+= Save this image for uploading via ./save.sh"

koopmant commented 3 days ago

output on second run:

=+= Cleaning up ... =+= Cleaning up ...

chrisvanrun commented 3 days ago

Was that all the output? Not sure if it worked from your comment. Sorry.

koopmant commented 3 days ago

Sorry, that was not very clear. Yes, that's all the output; so it's not working. It runs cleanup first, then runs into the permission error again during chmod (leading to immediate exit) and then runs cleanup again on exit.

chrisvanrun commented 3 days ago

No problem! @amickan also ran into a similar looking problem. Hower above solution worked for her. Now I think it's possible that was from a different origin (the set -e origin). Yours likely stems from a slightly different ACL setup on Ubuntu 24.

Albeit if you temp fix it via sudo, forcing ownership to your own user, does it return running the test_run.sh twice?

I am hesitant to add calls like this, because it really kicks the interoperability of the script down a few nodges.


# Set the ACL recursively (-R) for the specified user and group
setfacl -R -m u:$USER:rwx -m g:$GROUP:rwx $DIRECTORY

# Set default ACL for future files and directories (so new files inherit the ACL)
setfacl -R -d -m u:$USER:rwx -m g:$GROUP:rwx $DIRECTORY

koopmant commented 3 days ago

Albeit if you temp fix it via sudo, forcing ownership to your own user, does it return running the test_run.sh twice?

Sorry, I don't quite understand what you mean.

FWIW, if you change this line: /bin/sh -c 'chown -R ${HOST_UID}:${HOST_GID} /output' to /bin/sh -c 'chown -R ${HOST_UID}:${HOST_GID} /output/*' the issue of running the script again is fixed. On my system, the ownership is then still on another user id, so I cannot open the file without providing root authentication; but this should still work on other systems.

the line chmod o+rwx "$OUTPUT_DIR" is then unneccessary, because the directory then never changed ownership. (But it doesn´t raise an issue.)

chrisvanrun commented 3 days ago

Sorry, let me clarify: if you fix ownership once. Does the proposed test_run.sh then keep working, even if ran multiple times?

The initial chmod is meant to ensure that the Docker's internal user can actually write to the output directory in the first place. The second chown is to ensure that the host user has full access to the files and is allowed to change any and all permission on it.

koopmant commented 1 day ago

Sorry, let me clarify: if you fix ownership once. Does the proposed test_run.sh then keep working, even if ran multiple times?

Yes, that sort of works. The chown attempted in the last docker command then fails with a permission error. So I remain the owner of the directory and the rest of the script is successfull. Still, that means the current user does not become owner of the files in the output directory. But at least the script runs and creates output.

DIAGNijmegen / rse-grand-challenge-forge

Mysterious permissions errors when running test_run.sh twice #59