edgelesssys / marblerun

MarbleRun is the control plane for confidential computing. Deploy, scale, and verify your confidential microservices on vanilla Kubernetes. 100% Go, 100% cloud native, 100% confidential.
https://marblerun.sh
Other
238 stars 34 forks source link

Edgeless Runtime Container deployment fails #228

Closed ratnadeepb closed 3 years ago

ratnadeepb commented 3 years ago

Issue description

Deploying the edgeless container runtime. I was trying to deploy my code in a container and it was continuously restarting. So I attempted to deploy ghcr.io/edgelesssys/edgelessrt-deploy:latest. It exhibits the same behavior.

To reproduce

Steps to reproduce the behavior:

  1. Pod Yaml:
    apiVersion: v1
    kind: Pod
    metadata:
    name: static-web
    spec:
    containers:
    - name: web
      image: ghcr.io/edgelesssys/edgelessrt-deploy:latest
  2. kubectl apply -f pod-test.yaml
  3. kubectl describe pods static-web
    Name:         static-web
    Namespace:    default
    Priority:     0
    Node:         <node-name>
    Start Time:   Fri, 06 Aug 2021 21:10:32 +0000
    Labels:       <none>
    Annotations:  <none>
    Status:       Running
    IP:           <ip>
    IPs:
    IP:  <ip>
    Containers:
    web:
    Container ID:   containerd://f6d96345df7803b1725ad40b19dd5aa66b7628c5fe37bb247ad4557c28c428da
    Image:          ghcr.io/edgelesssys/edgelessrt-deploy:latest
    Image ID:       ghcr.io/edgelesssys/edgelessrt-deploy@sha256:d622febf6c92c7a0062fea1dee20f5d0a35a386167888a39936129df87466cf3
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Aug 2021 21:16:19 +0000
      Finished:     Fri, 06 Aug 2021 21:16:19 +0000
    Ready:          False
    Restart Count:  6
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-tkhr8 (ro)
    Conditions:
    Type              Status
    Initialized       True
    Ready             False
    ContainersReady   False
    PodScheduled      True
    Volumes:
    default-token-tkhr8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-tkhr8
    Optional:    false
    QoS Class:       BestEffort
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
    Type     Reason     Age                    From               Message
    ----     ------     ----                   ----               -------
    Normal   Scheduled  6m10s                  default-scheduler  Successfully assigned default/static-web to <node-name>
    Normal   Pulled     6m9s                   kubelet            Successfully pulled image "ghcr.io/edgelesssys/edgelessrt-deploy:latest" in 520.117841ms
    Normal   Pulled     6m8s                   kubelet            Successfully pulled image "ghcr.io/edgelesssys/edgelessrt-deploy:latest" in 204.802328ms
    Normal   Pulled     5m53s                  kubelet            Successfully pulled image "ghcr.io/edgelesssys/edgelessrt-deploy:latest" in 236.375905ms
    Normal   Created    5m24s (x4 over 6m9s)   kubelet            Created container web
    Normal   Started    5m24s (x4 over 6m9s)   kubelet            Started container web
    Normal   Pulled     5m24s                  kubelet            Successfully pulled image "ghcr.io/edgelesssys/edgelessrt-deploy:latest" in 206.955528ms
    Normal   Pulling    4m34s (x5 over 6m10s)  kubelet            Pulling image "ghcr.io/edgelesssys/edgelessrt-deploy:latest"
    Warning  BackOff    58s (x25 over 6m7s)    kubelet            Back-off restarting failed container

Expected behavior

Environment:

Additional info / screenshots

Nirusu commented 3 years ago

Could you please provide logs with kubectl logs static-web --previous?

A wild guess from my side would be that your Pod configuration lacks the required Marblerun information.

See the Deploy your service with Kubernetes guide we have, or the general Kubernetes integration one on how to integrate your application into the Kubernetes cluster properly.

But yeah, to be sure, logs would be great :)

ratnadeepb commented 3 years ago

The output is blank:

$ kubectl logs static-web --previous
$

I had already run:

marblerun namespace add default

Regarding the manifest, based on this discussion, my understanding is that I need to define the Makefile, the manifest.template and the Dockerfile. Once the image is built, I can deploy the container and extract the enclave details to write the manifest.json. I ran into the same issue trying to deploy the my container, which is when I attempted to deploy the runtime container by itself.

I attempted the same with a named namespace with the same result.

Nirusu commented 3 years ago

Oh, so this is a follow-up from the old discussion.

The Edgeless RT deploy container is not suitable for running a Graphene application by default. The changes we made in the Redis sample related to the Dockerfile used the Edgeless RT deploy container as a base, but eventually for the Kubernetes deployment we use an image in which we build Graphene, build Redis and then define graphene-sgx pointing to the Redis with our MarbleRun LibOS premain process as the entrypoint.

What I would recommend you to do, step-by-step, is to:

  1. Build your application with Graphene, keep MarbleRun out-of-focus first
  2. Build a Docker container which installs/builds Graphene, builds your project and can be used to directly launch your Graphene application when using docker run. You can use plain Ubuntu as a base for your Dockerfile if you like, eventually it just needs the required SGX and Graphene components for building and running. You can use our graphene-redis Dockerfile as an orientation for this step (though we use the edgelessrt-deploy container as base, which already contains a bunch of the SGX dependencies).

These two steps have nothing do to with MarbleRun yet. They solely focus on Graphene and getting your application to run without any MarbleRun additions yet on a local machine, without any cluster & Kubernetes so far. Basically, we are recreating what GSC does internally.

When you got this running, you can already attempt if you can extract or retrieve the SGX signature values from your application within the container. When you completely automate the build process within the container (ideally with a signing key imported externally as a Docker secret), you might even get this during build time of your Docker container. If you want to retrieve them after you build the container, you can refer to @m1ghtym0's comments on copying the .sig file and using graphene-sgx-get-token to achieve this afterwards.

If these steps worked for you and you were able to retrieve the SGX signature values, then I would recommend you to continue adapting this application to run with MarbleRun & Kubernetes.

  1. Change your Graphene application build process to include MarbleRun LibOS premain. The documentation includes the changes you need to make to Graphene's manifest.template. Alternatively, you can use the same MarbleRun CLI tool which you used for namespace adding to perform this changes (unless Graphene's recent changes have broken something...)
  2. Rebuild your Docker image with these changes, extract the SGX signature values again. Test if you can launch your container (it should fail within the LibOS premain process, but that's fine for this point! At least it launches from Graphene!).
  3. Define a MarbleRun manifest.json with the retrieved SGX signature values, and deploy it to the MarbleRun coordinator with the CLI tool.
  4. Define your pod YAML with your Docker image (not the Edgeless RT deploy one!) and make sure our marble-injector can inject the required host environment variables (e.g.: by specifying the label marblerun/marbletype).

Hopefully, this is somewhat correct, I just wrote it down from how I would do it. I would recommend you to do this step-by-step, and if you hang somewhere, please tell us at which point exactly (e.g. your application runs with Graphene, you build a Docker image which launches your Graphene application, but you cannot get it working as a Marble) this fails.

Otherwise, with a lack of logs and a lack of source code, it's a bit hard to give exact advise on what is going wrong. The more details you can provide us, we better we can try to help!

ratnadeepb commented 3 years ago

I am actually slightly confused now. I used the Dockerfile in the redis example to build from. So would that not work?

If I use Ubuntu instead of edgelessrt-deploy, I'd have to build SGX capabilities inside the container. Do you have any pointers on how to do that? Should I just follow Intel's linux-sgx repo documentation for a docker build?

daniel-weisse commented 3 years ago

Could you share with us the Dockerfile you are using to build your image? This would be very helpful in identifying the issue.

Regarding edgelessrt-deploy: This image is based on ubuntu 18.04 with Edgeless RT and most of the Intel SGX libraries preinstalled. You can find the source-code here: https://github.com/edgelesssys/edgelessrt/blob/master/Dockerfile If you want to run a program using Graphene, you will need to install all of your programs and Graphene's dependencies.

Nirusu commented 3 years ago

Actually, I do not think it makes that much of a difference if you use edgelessrt-deploy or just plain Ubuntu as the base. You might need to install a couple more libsgx packages, but apart from that you should be able to adapt most of the steps from line 13-32 to setup Graphene:

https://github.com/edgelesssys/marblerun/blob/dd50a0c27bfc4d4b901d35558f63371276188230/samples/graphene-redis/Dockerfile

You might actually also give GSC a shot, though none of us has really tested this in practice because it always lacked behind in development for the most time... However, I do not see why it should not work.

ratnadeepb commented 3 years ago

I tried using gsc for the build. However, the signing fails. Couldn't figure out why! I raised the issue with graphene: https://github.com/oscarlab/graphene/issues/2636

ratnadeepb commented 3 years ago

In the meantime, I tried to build it from this Dockerfile:

FROM alpine/git:latest AS pull
RUN git clone https://github.com/edgelesssys/marblerun.git /premain

FROM ghcr.io/edgelesssys/edgelessrt-dev AS build-premain
COPY --from=pull /premain /premain
WORKDIR /premain/build
RUN cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
RUN make premain-libos

FROM ghcr.io/edgelesssys/edgelessrt-deploy:latest AS release
RUN apt-get update && apt-get install -y git meson build-essential autoconf gawk bison wget python3 libcurl4-openssl-dev \
    python3-protobuf libprotobuf-c-dev protobuf-c-compiler python3-pip software-properties-common python3-click python3-jinja2
RUN wget -qO- https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key | apt-key add
RUN add-apt-repository 'deb [arch=amd64] https://download.01.org/intel-sgx/sgx_repo/ubuntu bionic main'
RUN apt-get install -y libsgx-quote-ex-dev libsgx-aesm-launch-plugin
RUN python3 -m pip install "toml>=0.10"
RUN python3 -m pip install --upgrade tensorflow

ENV TZ=America/New_York

RUN apt-get update && apt-get install -y \
        libsm6 \
        libxext6 \
        libxrender-dev

RUN pip3 install \
        keras==2.2.4 \
        pillow \
        matplotlib \
        pandas \
        xlrd \
        openpyxl \
        xlsxwriter \
        imageio

RUN git clone https://github.com/intel/SGXDataCenterAttestationPrimitives.git /SGXDriver
WORKDIR /SGXDriver
RUN git reset --hard a93785f7d66527aa3bd331ba77b7993f3f9c729b

RUN git clone https://github.com/oscarlab/graphene.git /graphene
WORKDIR /graphene
RUN git reset --hard b37ac75efec0c1183fd42340ce2d3e04dcfb3388
RUN make ISGX_DRIVER_PATH=/SGXDriver/driver/linux/ SGX=1
RUN meson build -Ddirect=disabled -Dsgx=enabled
RUN ninja -C build
RUN ninja -C build install

COPY --from=build-premain /premain/build/premain-libos /graphene/Examples/training/
COPY dist_mnist.manifest.template /graphene/Examples/training/

RUN mkdir -p /graphene/Examples/training
COPY dist_mnist.py /graphene/Examples/training
COPY Makefile /graphene/Examples/training
COPY dist_mnist.manifest.template /graphene/Examples/training
COPY --from=build-premain /premain/build/premain-libos /graphene/Examples/training

# RUN apt install libnss-mdns python3-numpy python3-scipy9

RUN cd /graphene/Examples/training
ENV BUILD_TLS yes
RUN --mount=type=secret,id=signingkey,dst=/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem,required=true \
    make clean && make SGX=1 PYTHONVERSION=python3.6 PYTHONDISTHOME=/usr/local/lib/python3.6/dist-packages/

ENTRYPOINT ["graphene-sgx", "/graphene/Examples/training/dist_mnist.py"]

It is built with DOCKER_BUILDKIT=1 docker build -t dist_mnist_manual -f Dockerfile-new --secret id=signingkey,src=enclave-key.pem .

Finally the deployment yaml:

apiVersion: v1
kind: Pod
metadata:
  name: test
  namespace: kubedep
  labels:
    app.kubernetes.io/name: test
    app.kubernetes.io/part-of: test
    app.kubernetes.io/version: v1
    marblerun/inject: enabled
    marblerun/marbletype: test
spec:
  containers:
    - name: web
      image: <acr>/dist_mnist_manual
      ports:
        - name: static-web
          containerPort: 80
          protocol: TCP

The pod fails to start:

~$ kubectl get pods -n kubedep
NAME   READY   STATUS             RESTARTS   AGE
test   0/1     CrashLoopBackOff   3          77s
~$ kubectl logs -n kubedep test
Invalid application path specified (/graphene/Examples/training/dist_mnist.py.manifest.sgx does not exist).
The path should point to application configuration files, so that they can be
found after appending corresponding extensions.
~$ kubectl logs -n kubedep test --previous
Invalid application path specified (/graphene/Examples/training/dist_mnist.py.manifest.sgx does not exist).
The path should point to application configuration files, so that they can be
found after appending corresponding extensions.
Nirusu commented 3 years ago

Well, if you call graphene-sgx /graphene/Examples/training/dist_mnist.py to launch your application, Graphene tries to find the manifest under the same name + ".manifest.sgx".

However, your dist_mnist.manifest.template misses the .py suffix in the middle. So maybe rename this one and make sure your Makefile does something like this:

graphene-sgx-sign -output dist_mnist.manifest.py.manifest.sgx --manifest dist_mnist.py.manifest.template --key /graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem

Additionally, I believe you cannot directly call a Python file to launch with graphene-sgx. Now I do not know what your manifest contains and if you already call the MarbleRun premain-libos or not. Now, if you can actually call the Python file directly as (post-premain) entry point... Actually, have not tested that one yet. Guess that depends on Graphene, however usually if you do this directly without the MarbleRun premain, this is not supported by Graphene AFAIK.

ratnadeepb commented 3 years ago

Ok. Any suggestions on how I can rewrite the Dockerfile?

Nirusu commented 3 years ago

Just rename the output of your graphene-sgx-sign call in your Makefile to match what's shown as expected in the error. If you don't have one, you should properly add one.

Regarding the entry point / python binary: not sure, try it out. Graphene has an Python example... properly best to start with that one.

Note that you likely also need to install the Python modules inside Graphene's environment (or pass through, though not sure if that would be a good idea).

ratnadeepb commented 3 years ago

From the manifest file:

# MARBLERUN: entrypoint must be premain-libos
libos.entrypoint = "premain-libos"
loader.argv0_override = "dist_mnist.py"
loader.insecure__use_host_env = 1
Nirusu commented 3 years ago

The content is not the issue (so far at least), it's just that Graphene cannot find the signed manifest file.

You just need to rename the (signed) signature file first, which happens in your Makefile which I don't have, so I can't tell exactly what to change... But I told you above above what you likely should change (or include in your Makefile, in case you don't have it).

ratnadeepb commented 3 years ago

That didn't work. As a further test, I tried running the python interpreter by changing the last line of the Dockerfile so: ENTRYPOINT ["graphene-sgx", "python", "-c \"print('Hello World')\""] The Makefile I am using is https://github.com/oscarlab/graphene/blob/master/Examples/python/Makefile and I am getting similar errors:

~$ kubectl logs -n kubedep test
Invalid application path specified (python.manifest.sgx does not exist).
The path should point to application configuration files, so that they can be
found after appending corresponding extensions.
Nirusu commented 3 years ago

Well... Graphene cannot find python.manifest.sgx in the current working directory of the Docker environment.

It's properly better to use an absolute path, or to specify WORKDIR before defining ENTRYPOINT.

Then if you do this, make sure WORKDIR actually contains python.manifest.sgx, which should be generated from python.manifest.template after calling graphene-sgx-sign onto it (which the Makefile you linked actually does).

Honestly, these are pretty basic mistakes. Graphene just cannot find the manifest file derived from the name of your entry point.

Just to give you an idea on where Graphene searches for the manifest file:

$ mkdir emptydir && cd emptydir
$ graphene-sgx python
Invalid application path specified (python.manifest.sgx does not exist).
The path should point to application configuration files, so that they can be
found after appending corresponding extensions.

$ touch python.manifest.sgx
$ graphene-sgx python
error: Enclave size not a power of two (an SGX-imposed requirement)
error: Parsing manifest failed
error: load_enclave() failed with error -22

I really recommend you to go through this step-by-step on a local or virtual machine instead before throwing it into a Dockerfile. This helps you to evaluate if your application actually works with Graphene, how the folder layout needs to look like, what to put into the manifest, what to use as ENTRYPOINT when eventually defining the Dockerfile. etc. Right now you are tweaking too many things at once, without even getting anything to launch. It might be painful to go this way, so please do it step-by-step as I listed above.

If you actually get something to launch with Graphene, whether it's failing or not, that would be a step forward to help you understanding what you are doing and get your project running. So please, don't tweak too many things at once :)

ratnadeepb commented 3 years ago

Seems like the issue was with the how I had written the graphene manifest template and overall how I was running things. I corrected the manifest. But now I am building for Python3 and defining entrypoint as:

WORKDIR /graphene/Examples/training
ENTRYPOINT ["graphene-sgx", "python", "dist_mnist.py"]

The infrastructure is a Kubernetes cluster service on SGX enabled servers on Azure. Trying to run sudo graphene-sgx python dist_mnist.py throws:

$ sudo graphene-sgx python dist_mnist.py
error: ECREATE failed in enclave creation ioctl (errno = -22)
error: Creating enclave failed: -22
error: load_enclave() failed with error -22

Deploying the Docker image:

$ kubectl logs -n kubedep test
error: Cannot open device /dev/sgx/enclave. Please make sure the Intel SGX kernel module is loaded.
error: load_enclave() failed with error -2
ratnadeepb commented 3 years ago
$ ~/graphene/Pal/src/host/Linux-SGX/tools/is-sgx-available/is_sgx_available
SGX supported by CPU: true
SGX1 (ECREATE, EENTER, ...): true
SGX2 (EAUG, EACCEPT, EMODPR, ...): false
Flexible Launch Control (IA32_SGXPUBKEYHASH{0..3} MSRs): true
SGX extensions for virtualizers (EINCVIRTCHILD, EDECVIRTCHILD, ESETCONTEXT): false
Extensions for concurrent memory management (ETRACKC, ELDBC, ELDUC, ERDINFO): false
CET enclave attributes support (See Table 37-5 in the SDM): false
Key separation and sharing (KSS) support (CONFIGID, CONFIGSVN, ISVEXTPRODID, ISVFAMILYID report fields): false
Max enclave size (32-bit): 0x80000000
Max enclave size (64-bit): 0x1000000000
EPC size: 0x3800000
SGX driver loaded: true
AESMD installed: true
SGX PSW/libsgx installed: false
daniel-weisse commented 3 years ago

The infrastructure is a Kubernetes cluster service on SGX enabled servers on Azure. Trying to run sudo graphene-sgx python dist_mnist.py throws:

$ sudo graphene-sgx python dist_mnist.py
error: ECREATE failed in enclave creation ioctl (errno = -22)
error: Creating enclave failed: -22
error: load_enclave() failed with error -22

I assume this step is run in a docker container? If so, was the image built on an SGX capable machine? In my experience this issue occurs when you try to build a docker image with Graphene-SGX on a non SGX capable machine.

Deploying the Docker image:

$ kubectl logs -n kubedep test
error: Cannot open device /dev/sgx/enclave. Please make sure the Intel SGX kernel module is loaded.
error: load_enclave() failed with error -2

Does your cluster have an SGX device plugin installed? If it does, does your pod have the necessary resource request to make use of the plugin? E.g. if you are using the Intel SGX Plugin your pod will need something similar to:

resources:
limits:
sgx.intel.com/enclave: 1
sgx.intel.com/epc: 10Mi
sgx.intel.com/provision: 1
ratnadeepb commented 3 years ago
I assume this step is run in a docker container?
If so, was the image built on an SGX capable machine?
In my experience this issue occurs when you try to build a docker image with Graphene-SGX on a non SGX capable machine.

This step was on one of the SGX enabled AKS nodes. Built on the same one too.

I am using the Azure ones instead of Intel: https://github.com/Azure/aks-engine/blob/master/docs/topics/sgx.md#deploying-the-sgx-device-plugin.

The kernel on the node is 5.4.

daniel-weisse commented 3 years ago

I am using the Azure ones instead of Intel: https://github.com/Azure/aks-engine/blob/master/docs/topics/sgx.md#deploying-the-sgx-device-plugin.

In that case your pods should request epc using azures plugin:

apiVersion: v1
kind: Pod
metadata:
  name: <pod_name>
spec:
  containers:
    - name: <container_name>
      image: <your_image>
      resources:
        limits:
          kubernetes.azure.com/sgx_epc_mem_in_MiB: 10

As for Graphene not working, have you tried running your code outside of the docker environment? I.e installed Graphene on your machine directly and managed to get any of their examples, or your code, running? If that is not the case, something is wrong with your installation or setup, and I would suggested raising and issue over at Graphene directly, as they are much more experienced with the project and can probably provide much better help.

ratnadeepb commented 3 years ago

good point. thanks for that. I should have thought of that myself. anyhow, seems the issue was with the manifest file. running the graphene examples pointed me in the right direction. thanks for all the help @daniel-weisse and @Nirusu!

rguikers commented 2 years ago

@ratnadeepb , can you share a working manifest for your case. I'm experiencing the same problems.. Thanks!

ratnadeepb commented 2 years ago

hey @rguikers, my apologies for the late reply. I was doing this during an internship over the summer. I don't have access to that anymore. I am so sorry about that.

rguikers commented 2 years ago

hey @rguikers, my apologies for the late reply. I was doing this during an internship over the summer. I don't have access to that anymore. I am so sorry about that.

No problem, thanks..