containers / crun

A fast and lightweight fully featured OCI runtime and C library for running containers
GNU General Public License v2.0
3.08k stars 313 forks source link

Difference in RuntimeDefault seccompProfile.type between crun and runc #1600

Closed NoamNakash closed 3 weeks ago

NoamNakash commented 3 weeks ago

Hello

We are trying to replace runc with crun in our k8s clusters, and in our tests we encountered a permission issue with an internal backup and restore process of products.

The specific error encountered was with a sidecar container attempting to untar an archive onto shared storage between the main and sidecar containers during restore: 'tar: data: Cannot change mode to rwxrwsr-x: Operation not permitted'

This seems to be raising form the running of this command inside the sidecar: tar -p --use-compress-program=\"gzip -d\" -xvf 20241028102632_LOCAL_ccas_ccas-apache-0_volume.tar.gz -C untar_dir

We have confirmed the issue was not related to selinux and that the data directory has already had rwxrwsr-x permissions, and both pods runAsUser and runAsGroup values are the same (818)

We are not running rootless containers, and we confirmed run.oci.keep_original_groups=1 does not resolve the issue.

After further investigation, we found that after replacing the seccompProfile.type value from RuntimeDefault to Unconfined, the restore process completes successfully.

Is there an intended difference in RuntimeDefault seccompProfile.type between crun and runc? Is this expected behavior in that case?

Here is the relevant statefulset

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: ccas-apache
    meta.helm.sh/release-namespace: ccas2
  creationTimestamp: "2024-11-06T08:26:18Z"
  generation: 3
  labels:
    app: ccas-apache
    app.kubernetes.io/component: cassandra
    app.kubernetes.io/instance: ccas-apache
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: ccas-apache
    app.kubernetes.io/version: ccas-apache-2.0-3_cassandra-4.1.6-0
    helm.sh/chart: ccas-apache-8.2.2
    release: ccas-apache
    sidecar.istio.io/inject: "false"
  name: ccas-apache
  namespace: ccas2
  resourceVersion: "6044718"
  uid: 56245b9b-097f-48b2-91be-41a41fe67bd3
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: OrderedReady
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: ccas-apache
      release: ccas-apache
  serviceName: ccas-apache
  template:
    metadata:
      annotations:
        checksum/config: |
          26c538f25ff80907a331902b2ba667f28c57b1b8a32f36fe127f49519783e2a9
          90dd30c1f87db281d5e2b9d19491977f100f7037b79cfb7433b6aeec82492c60
      creationTimestamp: null
      labels:
        app: ccas-apache
        app.kubernetes.io/component: cassandra
        app.kubernetes.io/instance: ccas-apache
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: ccas-apache
        app.kubernetes.io/version: ccas-apache-2.0-3_cassandra-4.1.6-0
        helm.sh/chart: ccas-apache-8.2.2
        release: ccas-apache
        sidecar.istio.io/inject: "false"
        type: server
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: type
                  operator: In
                  values:
                  - server
                matchLabels:
                  app: ccas-apache
                  release: ccas-apache
              topologyKey: kubernetes.io/hostname
            weight: 1
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ccas-apache
                - key: type
                  operator: In
                  values:
                  - server
              topologyKey: topology.kubernetes.io/zone
            weight: 50
      automountServiceAccountToken: false
      containers:
      - args:
        - -c
        - |
          echo "Encrypting secrets"
          mkdir -p /CASSANDRA_DD/cassandra/.ccas
          mkdir -p /home/cassandra-home/.cassandra/gpg
          chmod 770 /home/cassandra-home/.cassandra/gpg
          source /secrets/k8s-restricted-config.sh

          encrypt_secrets VWUzMVFDQVk3bDR6cGtHanc1d0E=
          echo "Secrets encrypted"
          [ ! -e /tmp/secret.fifo ] && mkfifo /tmp/secret.fifo
          ( echo VWUzMVFDQVk3bDR6cGtHanc1d0E= > /tmp/secret.fifo )&
          /usr/bin/harmonize_log --service=cassandra --addcommand=/opt/cass-tools/scripts/cassandra-monitor --log-file=${CASSANDRA_LOG_DIR}/system.log  --log-file=${CASSANDRA_LOG_DIR}/audit.log /lib/cassandra/deploy.sh
        command:
        - bash
        env:
        - name: CONFIG_CASSANDRA_SEEDS
          value: ccas-apache-0.ccas-apache.ccas2.svc.cluster.local,ccas-apache-1.ccas-apache.ccas2.svc.cluster.local
        - name: SS_CASSANDRA_CLUSTER_NAME
          value: MyCluster
        - name: SS_CASSANDRA_DATA_CENTER
          value: MyCenter
        - name: SS_CASSANDRA_SETUP_FIREWALL_RULES
          value: "false"
        - name: MAX_HEAP_SIZE
          value: 2g
        - name: HEAP_NEWSIZE
          value: 512m
        - name: SS_CASSANDRA_DATADIR_LV_NAME
          value: CASSANDRA_DD
        - name: SS_CASSANDRA_COMMITLOG_LV_NAME
          value: CASSANDRA_DD
        - name: SS_CASSANDRA_BACKUP_LV_NAME
          value: CASSANDRA_DD
        - name: SS_DEPLOYMENT_TYPE
          value: kube
        - name: CASSANDRA_MIN_CONFIG
          value: "yes"
        - name: REDUCED_MAX_HEAP_SIZE
          value: "128"
        - name: REDUCED_HEAP_NEWSIZE
          value: "12"
        - name: RPC_SERVER_TYPE
          value: hsha
        - name: CONCURRENT_READS
          value: "4"
        - name: CONCURRENT_WRITES
          value: "4"
        - name: COMPACTION_THROUGHPUT
          value: 0MiB/s
        - name: CONCURRENT_COMPACTORS
          value: "1"
        - name: RPC_MIN_THREADS
          value: "4"
        - name: RPC_MAX_THREADS
          value: "4"
        - name: KEY_CACHE_SIZE
          value: 32MiB
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: METRICS_AGENT_NAME
          value: datastax-mcac-agent
        - name: METRICS_AGENT_VERSION
          value: 0.3.5-4.1-beta1
        - name: TZ
          value: UTC
        envFrom:
        - configMapRef:
            name: ccas-apache-env
        image: csf-docker-delivered.repo.cci.nokia.net/ccas-apache:4.1.6-0.2487-rocky8
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - bash
              - -c
              - |
                /opt/cass-tools/scripts/cassandra.sh --stop
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              curl --fail "http://localhost:8080/api/v0/probes/liveness" && [[ ! -f ${SS_CASSANDRA_DATADIR}/../.alarm_3000714 ]]
          failureThreshold: 3
          initialDelaySeconds: 100
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        name: ccas-apache
        ports:
        - containerPort: 7000
          name: tcp-intra-node
          protocol: TCP
        - containerPort: 7001
          name: tls-intra-node
          protocol: TCP
        - containerPort: 7199
          name: tcp-jmx
          protocol: TCP
        - containerPort: 9042
          name: tcp-cql
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              curl --fail "http://localhost:8080/api/v0/probes/readiness"
          failureThreshold: 3
          initialDelaySeconds: 100
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            ephemeral-storage: 1Gi
            memory: 2Gi
          requests:
            cpu: "1"
            ephemeral-storage: 1Gi
            memory: 1Gi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 818
          runAsUser: 818
        startupProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              curl --fail "http://localhost:8080/api/v0/probes/liveness"
          failureThreshold: 60
          initialDelaySeconds: 60
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 15
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/unified-logging/cpp-api
          name: uniflogging-conf
        - mountPath: /config
          name: cassandra-conf
        - mountPath: /CASSANDRA_DD
          name: data
        - mountPath: /tmp
          name: tmpdir
        - mountPath: /var/run/supervisor
          name: supervisorrundir
        - mountPath: /var/log/supervisor
          name: supervisorlogdir
        - mountPath: /var/log/cassandra
          name: cassandralogdir
        - mountPath: /var/lib/cassandra
          name: cassandralibdir
        - mountPath: /home/cassandra-home
          name: cassandrahomedir
        - mountPath: /secrets
          name: secrets
      - env:
        - name: TZ
          value: UTC
        image: csf-docker-delivered.repo.cci.nokia.net/cbur/cbur-agent:1.3.1-alpine-1444
        imagePullPolicy: IfNotPresent
        name: cbura-sidecar
        resources:
          limits:
            ephemeral-storage: 64Mi
            memory: 256Mi
          requests:
            cpu: 100m
            ephemeral-storage: 64Mi
            memory: 256Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 818
          runAsUser: 818
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/unified-logging/cpp-api
          name: uniflogging-conf
        - mountPath: data
          name: data
          subPath: cassandra/backup
        - mountPath: /tmp
          name: cburtmp
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - -c
        - |
          echo "Copying config files to RW volume"
          cp /tmp/config/* /config/
        command:
        - bash
        env:
        - name: TZ
          value: UTC
        image: csf-docker-delivered.repo.cci.nokia.net/tools/kubectl:1.28.13-rocky8-nano-20240828
        imagePullPolicy: IfNotPresent
        name: ccas-apache-config
        resources:
          limits:
            ephemeral-storage: 64Mi
            memory: 64Mi
          requests:
            cpu: 100m
            ephemeral-storage: 64Mi
            memory: 64Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 818
          runAsUser: 818
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/unified-logging/cpp-api
          name: uniflogging-conf
        - mountPath: /tmp/config
          name: tmp-cassandra-conf
        - mountPath: /config
          name: cassandra-conf
      - args:
        - -c
        - |
          echo "Copying secrets to RW volume"
          mkdir -p /secrets/ccas
          cp /import/ccas/SS_CASSANDRA_SUPERUSER_PASS /secrets/ccas/SS_CASSANDRA_SUPERUSER_PASS
          cp /import/ccas/SS_CASSANDRA_SUPERUSER_NAME /secrets/ccas/SS_CASSANDRA_SUPERUSER_NAME
          cp /import/k8s-restricted-config.sh /secrets/k8s-restricted-config.sh
          echo "Setting permissions for restricted config script"
          chmod 770 /secrets/k8s-restricted-config.sh
        command:
        - bash
        env:
        - name: TZ
          value: UTC
        image: csf-docker-delivered.repo.cci.nokia.net/tools/kubectl:1.28.13-rocky8-nano-20240828
        imagePullPolicy: IfNotPresent
        name: ccas-apache-initializer
        resources:
          limits:
            ephemeral-storage: 64Mi
            memory: 64Mi
          requests:
            cpu: 100m
            ephemeral-storage: 64Mi
            memory: 64Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 818
          runAsUser: 818
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/unified-logging/cpp-api
          name: uniflogging-conf
        - mountPath: /import/ccas/SS_CASSANDRA_SUPERUSER_PASS
          name: ccas-password
          subPath: SS_CASSANDRA_SUPERUSER_PASS
        - mountPath: /import/ccas/SS_CASSANDRA_SUPERUSER_NAME
          name: ccas-username
          subPath: SS_CASSANDRA_SUPERUSER_NAME
        - mountPath: /import/ccas/CLIENT_KEYSTORE_PASSWORD
          name: client-keystore-password
          subPath: CLIENT_KEYSTORE_PASSWORD
        - mountPath: /import/ccas/CLIENT_TRUSTSTORE_PASSWORD
          name: client-truststore-password
          subPath: CLIENT_TRUSTSTORE_PASSWORD
        - mountPath: /import/ccas/SERVER_KEYSTORE_PASSWORD
          name: server-keystore-password
          subPath: SERVER_KEYSTORE_PASSWORD
        - mountPath: /import/ccas/SERVER_TRUSTSTORE_PASSWORD
          name: server-truststore-password
          subPath: SERVER_TRUSTSTORE_PASSWORD
        - mountPath: /import/k8s-restricted-config.sh
          name: k8s-restricted-config
          subPath: k8s-restricted-config.sh
        - mountPath: /secrets
          name: secrets
      - command:
        - sh
        - -c
        - mkdir -p /CASSANDRA_DD/cassandra; chmod -R g+rwX /CASSANDRA_DD/cassandra;
          exit 0
        env:
        - name: TZ
          value: UTC
        image: csf-docker-delivered.repo.cci.nokia.net/tools/kubectl:1.28.13-rocky8-nano-20240828
        imagePullPolicy: IfNotPresent
        name: ccas-apache-init-mountdir
        resources:
          limits:
            ephemeral-storage: 64Mi
            memory: 64Mi
          requests:
            cpu: 100m
            ephemeral-storage: 64Mi
            memory: 64Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 818
          runAsUser: 818
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/unified-logging/cpp-api
          name: uniflogging-conf
        - mountPath: /CASSANDRA_DD
          name: data
        - mountPath: /tmp
          name: tmpdir
      - command:
        - sh
        - -c
        - touch /CASSANDRA_DD/file1; ln /CASSANDRA_DD/file1 /CASSANDRA_DD/link1; rm
          -f /CASSANDRA_DD/file1; rm /CASSANDRA_DD/link1
        image: csf-docker-delivered.repo.cci.nokia.net/tools/kubectl:1.28.13-rocky8-nano-20240828
        imagePullPolicy: IfNotPresent
        name: ccas-apache-init-checks
        resources:
          limits:
            ephemeral-storage: 64Mi
            memory: 64Mi
          requests:
            cpu: 100m
            ephemeral-storage: 64Mi
            memory: 64Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 818
          runAsUser: 818
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/unified-logging/cpp-api
          name: uniflogging-conf
        - mountPath: /CASSANDRA_DD
          name: data
      restartPolicy: Always
      runtimeClassName: crun
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 818
        fsGroupChangePolicy: OnRootMismatch
        runAsGroup: 818
        runAsNonRoot: true
        runAsUser: 818
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: ccas-apache
      serviceAccountName: ccas-apache
      terminationGracePeriodSeconds: 60
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: log4cxxproperty
            path: log4cxx.property
          name: ccas-apache-log4cxx-server
        name: uniflogging-conf
      - emptyDir: {}
        name: cburtmp
      - configMap:
          defaultMode: 436
          name: ccas-apache
        name: tmp-cassandra-conf
      - emptyDir: {}
        name: cassandra-conf
      - emptyDir: {}
        name: tmpdir
      - emptyDir: {}
        name: supervisorrundir
      - emptyDir: {}
        name: supervisorlogdir
      - emptyDir: {}
        name: cassandralogdir
      - emptyDir: {}
        name: cassandralibdir
      - emptyDir: {}
        name: cassandrahomedir
      - configMap:
          defaultMode: 493
          name: ccas-apache-k8s-restricted-config
        name: k8s-restricted-config
      - name: ccas-password
        secret:
          defaultMode: 420
          items:
          - key: cassandra_superpass
            path: SS_CASSANDRA_SUPERUSER_PASS
          secretName: ccas-apache
      - emptyDir: {}
        name: secrets
      - name: ccas-username
        secret:
          defaultMode: 420
          items:
          - key: cassandra_superuser
            path: SS_CASSANDRA_SUPERUSER_NAME
          secretName: ccas-apache
      - name: client-keystore-password
        secret:
          defaultMode: 420
          items:
          - key: client_keystore_password
            path: CLIENT_KEYSTORE_PASSWORD
          secretName: ccas-apache-cert-key
      - name: client-truststore-password
        secret:
          defaultMode: 420
          items:
          - key: client_truststore_password
            path: CLIENT_TRUSTSTORE_PASSWORD
          secretName: ccas-apache-cert-key
      - name: server-keystore-password
        secret:
          defaultMode: 420
          items:
          - key: server_keystore_password
            path: SERVER_KEYSTORE_PASSWORD
          secretName: ccas-apache-cert-key
      - name: server-truststore-password
        secret:
          defaultMode: 420
          items:
          - key: server_truststore_password
            path: SERVER_TRUSTSTORE_PASSWORD
          secretName: ccas-apache-cert-key
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      labels:
        app: ccas-apache
        heritage: Helm
        release: ccas-apache
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 2
  collisionCount: 0
  currentReplicas: 2
  currentRevision: ccas-apache-6c7ccdb484
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  updateRevision: ccas-apache-6c7ccdb484
  updatedReplicas: 2
giuseppe commented 3 weeks ago

could you use strace to find out what syscall is failing?

giuseppe commented 3 weeks ago

and there should not be any difference, the seccomp configuration is coming from the container engine, the OCI runtime (crun or runc) just apply it

NoamNakash commented 3 weeks ago

Sure, Here is the strace

[pid 2029049] writev(1, [{iov_base="-rw-r--r-- 818/818          82 2"..., iov_len=69}, {iov_base="\n", iov_len=1}], 2) = 70
[pid 2029049] openat(4, "data/last_archive.log", O_WRONLY|O_CREAT|O_EXCL|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_CLOEXEC, 0644) = 5
[pid 2029049] write(5, "/CASSANDRA_DD/cassandra/data/arc"..., 82) = 82
[pid 2029049] fstat(5, {st_mode=S_IFREG|0644, st_size=82, ...}) = 0
[pid 2029049] utimensat(5, NULL, [{tv_sec=1730906704, tv_nsec=342102079} /* 2024-11-06T10:25:04.342102079-0500 */, {tv_sec=1730869218, tv_nsec=0} /* 2024-11-06T00:00:18-0500 */], 0) = 0
[pid 2029049] close(5)                  = 0
[pid 2029049] writev(1, [{iov_base="-rw-r--r-- 818/818           0 2"..., iov_len=88}, {iov_base="\n", iov_len=1}], 2) = 89
[pid 2029049] openat(4, "data/MyCenter__1730869213546__SCHEMA.cql", O_WRONLY|O_CREAT|O_EXCL|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_CLOEXEC, 0644) = 5
[pid 2029049] fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 2029049] utimensat(5, NULL, [{tv_sec=1730906704, tv_nsec=342102079} /* 2024-11-06T10:25:04.342102079-0500 */, {tv_sec=1730869220, tv_nsec=0} /* 2024-11-06T00:00:20-0500 */], 0) = 0
[pid 2029049] close(5)                  = 0
[pid 2029049] close(3)                  = 0
[pid 2029049] wait4(222, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 222
[pid 2029049] munmap(0x7f7e2b51c000, 16384) = 0
[pid 2029049] utimensat(4, "data", [UTIME_OMIT, {tv_sec=1730869220, tv_nsec=0} /* 2024-11-06T00:00:20-0500 */], AT_SYMLINK_NOFOLLOW) = 0
[pid 2029049] syscall_0x1c4(0x4, 0x7f7e2b528b40, 0x5fd, 0x100, 0x100, 0x7f7e2b528b40) = -1 EPERM (Operation not permitted)
[pid 2029049] fcntl(1, F_GETFL)         = 0x8002 (flags O_RDWR|O_LARGEFILE)
[pid 2029049] writev(2, [{iov_base="tar: ", iov_len=5}, {iov_base=NULL, iov_len=0}], 2) = 5
[pid 2029049] writev(2, [{iov_base="data: Cannot change mode to rwxr"..., iov_len=37}, {iov_base=NULL, iov_len=0}], 2) = 37
[pid 2029049] writev(2, [{iov_base=": Operation not permitted", iov_len=25}, {iov_base=NULL, iov_len=0}], 2) = 25
[pid 2029049] writev(2, [{iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 2) = 1
[pid 2029049] fcntl(1, F_GETFL)         = 0x8002 (flags O_RDWR|O_LARGEFILE)
[pid 2029049] writev(2, [{iov_base="tar: ", iov_len=5}, {iov_base=NULL, iov_len=0}], 2) = 5
[pid 2029049] writev(2, [{iov_base="Exiting with failure status due "..., iov_len=50}, {iov_base=NULL, iov_len=0}], 2) = 50
[pid 2029049] writev(2, [{iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 2) = 1
[pid 2029049] close(1)                  = 0
[pid 2029049] close(2)                  = 0
[pid 2029049] exit_group(2)             = ?

The error seems to be [pid 2029049] syscall_0x1c4(0x4, 0x7f7e2b528b40, 0x5fd, 0x100, 0x100, 0x7f7e2b528b40) = -1 EPERM (Operation not permitted) This is a read syscall

We did see it has the same seccomp using crictl inspect, but for some reason, this code will get this error if we don't change the seccompprofile type, unlike the behavior we are having until now

hajnalmt commented 3 weeks ago

Hello @giuseppe, Thank you for checking the issue!

I think crun mounts the volume slightly differently than runc and it interacts with tar in a way that our general backup-restore procedures are just not working anymore. (This is probably true for the rootfs too, since even the prompt is different in the two containers which is quite amazing for me. This is not a problem, but we are speaking about two containers with the same image and config the only difference is the runtime binary, anyway, let's move on.)

I also managed to reproduce this with an empty directory and made an strace about the taring and untaring processes for both the runc and the crun cases. Please find the outputs below:

First the setup:

uname -a
Linux gate-fi607-03-controller-01.tesla.com 5.14.0-427.33.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 16 10:56:24 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="9.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.4 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
crun --version
crun version 1.14.3
commit: 1961d211ba98f532ea52d2e80f4c20359f241a98
rundir: /run/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d
spec: 1.0.2-dev
go: go1.18.7
libseccomp: 2.5.2

The container running in ccas1 namespace uses crun, the other one in ccas2 namespace uses runc.

Information from the container in the crun - ccas1 case: The whole output of the two containers:

kubectl exec -it -n ccas1 ccas-apache-0 -c cbura-sidecar -- sh
/ $ 
/ $ tar --version
tar (GNU tar) 1.35
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by John Gilmore and Jay Fenlason.
/ $ pwd
/

/ $ ls -la
total 76
drwxr-xr-x    1 root     root          4096 Nov  6 12:54 .
drwxr-xr-x    1 root     root          4096 Nov  6 12:54 ..
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 bin
drwxrwsr-x    2 root     818           4096 Nov  7 09:20 data
drwxr-xr-x    5 root     root           360 Nov  6 12:54 dev
drwxr-xr-x    1 root     root          4096 Nov  6 12:54 etc
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 home
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 lib
drwxr-xr-x    5 root     root          4096 Jul 22 14:34 media
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 mnt
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 opt
dr-xr-xr-x  709 root     root             0 Nov  6 12:54 proc
drwx------    2 root     root          4096 Jul 22 14:34 root
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 run
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 sbin
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 srv
dr-xr-xr-x   13 root     root             0 Nov  3 11:40 sys
drwxrwsrwx    3 root     818             71 Nov  7 11:10 tmp
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 usr
drwxr-xr-x    1 root     root          4096 Jul 22 14:34 var
/ $ id
uid=818 gid=818 groups=818
/ $ cd /tmp/
/tmp $ ls -lah /data
total 12K    
drwxrwsr-x    2 root     818         4.0K Nov  7 09:20 .
drwxr-xr-x    1 root     root        4.0K Nov  6 12:54 ..
/tmp $ mount 
overlay on / type overlay (ro,context="system_u:object_r:container_file_t:s0:c841,c892",relatime,lowerdir=/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/234/fs:/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/233/fs,upperdir=/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/385/fs,workdir=/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/385/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,context="system_u:object_r:container_file_t:s0:c841,c892",nosuid,size=65536k,mode=755,inode64)
devpts on /dev/pts type devpts (rw,context="system_u:object_r:container_file_t:s0:c841,c892",nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,seclabel,nosuid,nodev,noexec,relatime)
cgroup2 on /sys/fs/cgroup type cgroup2 (ro,seclabel,nosuid,nodev,noexec,relatime)
/dev/vdn on /data type ext4 (rw,seclabel,relatime)
/dev/vda1 on /tmp type xfs (rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/vdb on /etc/resolv.conf type ext4 (ro,seclabel,relatime)
/dev/vda1 on /etc/hosts type xfs (rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/vda1 on /dev/termination-log type xfs (rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/vdb on /etc/hostname type ext4 (ro,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,relatime,size=65536k,inode64)
/dev/vda1 on /etc/unified-logging/cpp-api type xfs (ro,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
tmpfs on /proc/acpi type tmpfs (ro,context="system_u:object_r:container_file_t:s0:c841,c892",relatime,size=0k,inode64)
devtmpfs on /proc/kcore type devtmpfs (ro,seclabel,size=4096k,nr_inodes=4099235,mode=755,inode64)
devtmpfs on /proc/keys type devtmpfs (ro,seclabel,size=4096k,nr_inodes=4099235,mode=755,inode64)
devtmpfs on /proc/timer_list type devtmpfs (ro,seclabel,size=4096k,nr_inodes=4099235,mode=755,inode64)
tmpfs on /proc/scsi type tmpfs (ro,context="system_u:object_r:container_file_t:s0:c841,c892",relatime,size=0k,inode64)
tmpfs on /sys/firmware type tmpfs (ro,context="system_u:object_r:container_file_t:s0:c841,c892",relatime,size=0k,inode64)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)

Information from the container in the runc - ccas2 case:

kubectl exec -it -n ccas2 ccas-apache-0 -c cbura-sidecar -- sh
[root@gate-fi607-03-controller-01 cloud-admin]# kubectl exec -it -n ccas2 ccas-apache-0 -c cbura-sidecar -- sh
~ $ 
~ $ tar --version
tar (GNU tar) 1.35
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by John Gilmore and Jay Fenlason.
~ $ ls -la 
total 76
drwxr-xr-x    1 root     root          4096 Nov  5 12:29 .
drwxr-xr-x    1 root     root          4096 Nov  5 12:29 ..
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 bin
drwxrwsr-x    2 root     818           4096 Nov  7 09:19 data
drwxr-xr-x    5 root     root           360 Nov  5 12:29 dev
drwxr-xr-x    1 root     root          4096 Nov  5 12:29 etc
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 home
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 lib
drwxr-xr-x    5 root     root          4096 Jul 22 14:34 media
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 mnt
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 opt
dr-xr-xr-x  713 root     root             0 Nov  5 12:29 proc
drwx------    2 root     root          4096 Jul 22 14:34 root
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 run
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 sbin
drwxr-xr-x    2 root     root          4096 Jul 22 14:34 srv
dr-xr-xr-x   13 root     root             0 Nov  3 11:40 sys
drwxrwsrwx    3 root     818             71 Nov  7 11:09 tmp
drwxr-xr-x    1 root     root          4096 Aug  9 21:52 usr
drwxr-xr-x    1 root     root          4096 Jul 22 14:34 var
~ $ id 
uid=818 gid=818 groups=818
~ $ 
/ $ cd /tmp/
/tmp $ ls -lah /data
total 12K    
drwxrwsr-x    2 root     818         4.0K Nov  7 09:20 .
drwxr-xr-x    1 root     root        4.0K Nov  6 12:54 ..
/tmp $ mount 
overlay on / type overlay (ro,context="system_u:object_r:container_file_t:s0:c255,c548",relatime,lowerdir=/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/234/fs:/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/233/fs,upperdir=/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/358/fs,workdir=/data0/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/358/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,context="system_u:object_r:container_file_t:s0:c255,c548",nosuid,size=65536k,mode=755,inode64)
devpts on /dev/pts type devpts (rw,context="system_u:object_r:container_file_t:s0:c255,c548",nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,seclabel,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup type cgroup2 (ro,seclabel,nosuid,nodev,noexec,relatime)
/dev/vdm on /data type ext4 (rw,seclabel,relatime)
/dev/vda1 on /tmp type xfs (rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/vdb on /etc/resolv.conf type ext4 (ro,seclabel,relatime)
/dev/vda1 on /etc/hosts type xfs (rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/vda1 on /dev/termination-log type xfs (rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/vdb on /etc/hostname type ext4 (ro,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k,inode64)
/dev/vda1 on /etc/unified-logging/cpp-api type xfs (ro,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,context="system_u:object_r:container_file_t:s0:c255,c548",relatime,inode64)
tmpfs on /proc/kcore type tmpfs (rw,context="system_u:object_r:container_file_t:s0:c255,c548",nosuid,size=65536k,mode=755,inode64)
tmpfs on /proc/keys type tmpfs (rw,context="system_u:object_r:container_file_t:s0:c255,c548",nosuid,size=65536k,mode=755,inode64)
tmpfs on /proc/timer_list type tmpfs (rw,context="system_u:object_r:container_file_t:s0:c255,c548",nosuid,size=65536k,mode=755,inode64)
tmpfs on /proc/scsi type tmpfs (ro,context="system_u:object_r:container_file_t:s0:c255,c548",relatime,inode64)
tmpfs on /sys/firmware type tmpfs (ro,context="system_u:object_r:container_file_t:s0:c255,c548",relatime,inode64)

The interesting directory is /data with the permissions:

drwxrwsr-x    2 root     818           4096 Nov  7 09:19 data

It's empty for both of the tests.

The tar and untar output in the test cases: crun:

/tmp $ tar -p --use-compress-program="gzip -6" -cvf example-backup.tar.gz  --exclude=/data/lost+found /data
tar: Removing leading `/' from member names
/data/
/tmp $ mkdir untar_dir
/tmp $ ls -la
total 16
drwxrwsrwx    4 root     818             64 Nov  8 08:45 .
drwxr-xr-x    1 root     root          4096 Nov  6 12:54 ..
-rw-r--r--    1 818      818            110 Nov  7 13:00 example-backup.tar.gz
drwxr-sr-x    3 818      818             18 Nov  7 13:17 untar_dir
/tmp $ tar  -p --use-compress-program="gzip -d" -xvf example-backup.tar.gz -C untar_dir
data/
tar: data: Cannot change mode to rwxrwsr-x: Operation not permitted
tar: Exiting with failure status due to previous errors
/tmp $ cd untar_dir
/tmp/untar_dir $ ls -la
total 0
drwxr-sr-x    3 818      818             18 Nov  7 13:17 .
drwxrwsrwx    4 root     818             64 Nov  7 13:00 ..
drwx--S---    2 818      818              6 Nov  7 09:20 data

runc:

/tmp $ ls -la
total 12
drwxrwsrwx    4 root     818             64 Nov  7 13:07 .
drwxr-xr-x    1 root     root          4096 Nov  5 12:29 ..
drwxrwsr-x    3 818      818            142 Nov  7 12:24 cbur
-rw-r--r--    1 818      818            110 Nov  7 13:07 example-backup.tar.gz
drwxr-sr-x    2 818      818              6 Nov  7 12:51 untar_dir
/tmp $ tar  -p --use-compress-program="gzip -d" -xvf example-backup.tar.gz -C untar_dir
data/
/tmp $ cd untar_dir
/tmp/untar_dir $ ls -lah
total 0      
drwxr-sr-x    3 818      818           18 Nov  7 13:13 .
drwxrwsrwx    4 root     818           64 Nov  7 13:07 ..
drwxrwsr-x    2 818      818            6 Nov  7 09:19 data
/tmp/untar_dir $

The output directory's permissions became

drwx--S---    2 818      818              6 Nov  7 09:20 data

for the crun case and the tar command failed. Probably tar compressed the folder with bad permissions in the first place. What should I check next? Do you have any suggestions?

Attached strace outputs: crun-tar-output.txt crun-untar-output.txt runc-tar-output.txt runc-untar-output.txt

hajnalmt commented 3 weeks ago

Oh I realized that the strace output is well speaking.

This syscall 0x1c4 gives different outputs for the two container:

The strace output for the runc case:

791322 mkdirat(4, "data", 0700)         = 0
791322 close(3)                         = 0
791322 wait4(293, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 293
791322 munmap(0x7f0e693db000, 16384)    = 0
791322 utimensat(4, "data", [UTIME_OMIT, {tv_sec=1730971192, tv_nsec=0} /* 2024-11-07T04:19:52-0500 */], AT_SYMLINK_NOFOLLOW) = 0
791322 syscall_0x1c4(0x4, 0x7f0e693e7b40, 0x5fd, 0x100, 0x100, 0x7f0e693e7b40) = -1 ENOSYS (Function not implemented)
 791322 newfstatat(4, "data", {st_mode=S_IFDIR|S_ISGID|0700, st_size=6, ...}, AT_SYMLINK_NOFOLLOW) = 0
791322 openat(4, "data", O_RDONLY|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 3
791322 stat("/proc/self/fd/3", {st_mode=S_IFDIR|S_ISGID|0700, st_size=6, ...}) = 0
791322 fchmodat(AT_FDCWD, "/proc/self/fd/3", 02775) = 0
791322 close(3)                         = 0
791322 close(1)                         = 0
791322 close(2)                         = 0
791322 exit_group(0)                    = ?

The crun one:

801373 mkdirat(4, "data", 0700)         = 0
801373 close(3)                         = 0
801373 wait4(342, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 342
801373 munmap(0x7f2b4fc9b000, 16384)    = 0
801373 utimensat(4, "data", [UTIME_OMIT, {tv_sec=1730971240, tv_nsec=0} /* 2024-11-07T04:20:40-0500 */], AT_SYMLINK_NOFOLLOW) = 0
801373 syscall_0x1c4(0x4, 0x7f2b4fca7b40, 0x5fd, 0x100, 0x100, 0x7f2b4fca7b40) = -1 EPERM (Operation not permitted)
 801373 fcntl(1, F_GETFL)                = 0x8002 (flags O_RDWR|O_LARGEFILE)
801373 writev(2, [{iov_base="tar: ", iov_len=5}, {iov_base=NULL, iov_len=0}], 2) = 5
801373 writev(2, [{iov_base="data: Cannot change mode to rwxr"..., iov_len=37}, {iov_base=NULL, iov_len=0}], 2) = 37
801373 writev(2, [{iov_base=": Operation not permitted", iov_len=25}, {iov_base=NULL, iov_len=0}], 2) = 25
801373 writev(2, [{iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 2) = 1
801373 fcntl(1, F_GETFL)                = 0x8002 (flags O_RDWR|O_LARGEFILE)
801373 writev(2, [{iov_base="tar: ", iov_len=5}, {iov_base=NULL, iov_len=0}], 2) = 5
801373 writev(2, [{iov_base="Exiting with failure status due "..., iov_len=50}, {iov_base=NULL, iov_len=0}], 2) = 50
801373 writev(2, [{iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 2) = 1
801373 close(1)                         = 0
801373 close(2)                         = 0
801373 exit_group(2)    

And also I realized that this syscall is not present on the system:

ausyscall --dump
Using x86_64 syscall table:
0   read
1   write
...
450 set_mempolicy_home_node
451 cachestat

The 0x1c4 shall be provided in decimal to ausyscall which translates to 452

This also explains why the Unconfined seccomprofile solved it... But why is this different in the two container runtime?

giuseppe commented 3 weeks ago

ah I think that is because runc "monkey patch" the seccomp profile to return ENOSYS by default, while crun expects it to be correct.

I assume you are using containerd for your cluster? Because CRI-O would just specify ENOSYS as the default action for the seccomp profile: https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json#L4

hajnalmt commented 3 weeks ago

Yes, we are using containerd!

hajnalmt commented 3 weeks ago

Thank you @giuseppe, I think we can close this one then. Do you know by accident that can we configure this for containerd too? I could find that the default action is errno: https://github.com/containerd/containerd/blob/f0a32c66dad1e9de716c9960af806105d691cd78/contrib/seccomp/seccomp_default.go#L456 But I didn't find it in any of the configs.

giuseppe commented 3 weeks ago

yes I think we can close it as it is a known difference, I disagree with the way runc does it and I'd like to not change the way crun expects just the correct configuration to be passed in.