edgelesssys / tf-training-sgx

0 stars 1 forks source link

[Marblerun] Replication of the reported bug #1

Open veenasai2 opened 2 years ago

veenasai2 commented 2 years ago

Hi,

I am creating this issue to communicate with the concern team at Edgeless on behalf of Intel (Gramine team).

I was trying to repro this issue. Here are my few observations:

  1. d-paddles-training-chief-0 and d-paddles-training-worker-0 pods terminate every time.
  2. Only d-paddles-training-ps-0 pod does not terminate and the STATUS changes from "Running" to "Error".
  3. The d-paddles-training-ps-0 pod logs say "[PreMain] environment variable not set: EDG_MARBLE_TYPE" and after providing the environment variable it proceeds little further but still the STATUS shows Error only.

So, I wanted to check whether this the same issue you have reported.

I have attached logs for d-paddles-training-ps-0 pod here.

Thanks, Veena d-paddles-training-ps-0_logs_after_env_var.txt d-paddles-training-ps-0_logs_before_env_var.txt

daniel-weisse commented 2 years ago

Hi @veenasai2 There seems to be something missing in your deployment, is the MarbleRun coordinator deployed to the cluster? The complete workflow for a running cluster should look like the following:

# Install MarbleRun
marblerun install
# Wait for MarbleRun to install
marblerun check
# Port-forward the client api to localhost
kubectl -n marblerun port-forward svc/coordinator-client-api 4433:4433 --address localhost >/dev/null &
# Upload the manifest
marblerun manifest set manifest.json localhost:4433
# Deploy the application
kubectl create namespace threads
kubectl apply -f deployment.yaml

The Pod should start in the threads namespace and successfully finish the premain process.

veenasai2 commented 2 years ago

Hi @daniel-weisse

Thanks for the reply. Earlier, I was following the steps mentioned here: https://github.com/edgelesssys/tf-training-sgx/blob/master/marblerun/README.md and directly began with "kubectl apply -f deployment.yaml" in the AKS cluster.

Now I followed all the steps that you listed above.

Here are the corresponding output:

1. marblerun install

Setting up MarbleRun Webhook... Done
MarbleRun installed successfully

2. marblerun check

marble-injector pods ready: 1/1
marblerun-coordinator pods ready: 1/1

3. kubectl -n marblerun port-forward svc/coordinator-client-api 4433:4433 --address localhost >/dev/null &

process number as output

4. marblerun manifest set manifest.json localhost:4433

No era config file specified, getting config from https://github.com/edgelesssys/marblerun/releases/download/v0.5.1/coordinator-era.json
Got latest config
2022-04-04T17:23:24+0530.703512Z [(H)ERROR] tid(0x7f538ce39700) | sgxquoteprovider: libdcap_quoteprov.so libdcap_quoteprov.so: cannot open shared object file: No such file or directory
[/__w/edgelessrt/edgelessrt/build/3rdparty/openenclave/openenclave-src/host/sgx/linux/sgxquoteproviderloader.c:oe_load_quote_provider:81]
2022-04-04T17:23:24+0530.703553Z [(H)ERROR] tid(0x7f538ce39700) | oe_initialize_quote_provider failed (oe_result_t=OE_QUOTE_PROVIDER_LOAD_ERROR) [/__w/edgelessrt/edgelessrt/build/3rdparty/openenclave/openenclave-src/host/sgx/sgxquoteprovider.c:oe_initialize_quote_provider:48]
2022-04-04T17:23:24+0530.703577Z [(H)ERROR] tid(0x7f538ce39700) | :OE_QUOTE_PROVIDER_LOAD_ERROR [/__w/edgelessrt/edgelessrt/build/3rdparty/openenclave/openenclave-src/host/sgx/hostverify_report.c:oe_verify_remote_report:32]
Error: OE_QUOTE_PROVIDER_LOAD_ERROR

As you can see, I got this error in Step 4. I have checked sgx-quote-helper plugin is installed in the cluster and running under kube-system namespace. I am assuming this is the manifest.json "https://github.com/edgelesssys/tf-training-sgx/blob/master/marblerun/manifest.json" that is being uploaded in Step 4.

Any particular dependency you think I missed to install?

Thanks, Veena

daniel-weisse commented 2 years ago

You are missing a quote provider library on your system. It is used to verify the attestation sent by the MarbleRun coordinator. The az-dcap-client package includes a quote provider library for Azure systems. Install it by running the following:

sudo apt-key adv --fetch-keys https://packages.microsoft.com/keys/microsoft.asc
sudo apt-add-repository 'https://packages.microsoft.com/ubuntu/20.04/prod main'
sudo apt update && sudo apt install az-dcap-client

Alternatively you can skip verification by setting the --insecure flag for the marblerun manifest set command.

veenasai2 commented 2 years ago

Hi @daniel-weisse,

Thanks , I am able to replicate the issue now.

bodzhang commented 2 years ago

@veenasai2 , do you have any insight on why the threads seem to be idle or keeps calling futex and getting ETIMEOUT?

veenasai2 commented 2 years ago

Hi @bodzhang, Currently I am seeing two suspicious messages "Unsupported system call 435" and "return from shim_futex(...) = -110" from the logs.

I am working on the issue, and will get back by next week with more details.

veenasai2 commented 2 years ago

Hi @daniel-weisse,

Based on my understanding, marblerun will run inside a kubernetes cluster. But still if there is any documentation that I refer to install marblerun on my local server and replicate the issue without using kubernetes (just the way we have gramine CI-Examples), that will be even more helpful for me to debug the issue.

Thanks, Veena

daniel-weisse commented 2 years ago

Hi @veenasai2 The easiest way to run the Coordinator locally is using docker. If you are on an Azure SGX VM you simply run:

docker run -it --rm \
   --network host \
   --device /dev/sgx_enclave \
   --device /dev/sgx_provision \
   -v /dev/sgx:/dev/sgx \
   ghcr.io/edgelesssys/coordinator

If your server is NOT running in Azure, you will need to configure the container to use Intel's quote provider library and mount your PCCS configuration to the container:

docker run -it --rm \
   --network host \
   --device /dev/sgx_enclave \
   --device /dev/sgx_provision \
   -v /dev/sgx:/dev/sgx \
   -v /etc/sgx_default_qcnl.conf:/etc/sgx_default_qcnl.conf \
   -e DCAP_LIBRARY=intel
   ghcr.io/edgelesssys/coordinator

This will start the MarbleRun Coordinator in docker, the Coordinator will be reachable by the applications, curl, or the MarbleRun CLI on localhost:4433

You can also directly run the Coordinator binary. However, this will require you to install additional tooling. Details can be found here: https://docs.edgeless.systems/marblerun/#/deployment/standalone

veenasai2 commented 2 years ago

Hi @daniel-weisse,

Thanks for sharing these steps. Using these steps, currently I am running the coordinator in Azure VM (Standard_DC8s_v3).

Also, I started python-threads workload in the same VM using the below command:

docker run -e SGX_AESM_ADDR=1 \ -e EDG_MARBLE_UUID_FILE=uuid \ -e EDG_MARBLE_TYPE=Threads \ -it --rm --network host \ --device /dev/sgx/enclave \ --device /dev/sgx/provision \ -v /dev/sgx:/dev/sgx \ -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \ ghcr.io/edgelesssys/threads-test:latest

Here, I get error FailedPrecondition desc = cannot accept marbles in current state.

I have attached both coordinator logs and python-threads logs. Any idea, what could possibly go wrong here?

Coordinator_logs.txt python-threads_logs.txt

Thanks.

daniel-weisse commented 2 years ago

Hi @veenasai2,

You will need to set a manifest for MarbleRun.

marblerun manifest set ./python-threads/manifest.json localhost:4433

I noticed the MarbleRun set up procedure (how to install MarbleRun, set the manfiest etc.) was missing from the READMEs, that should be fixed now.

veenasai2 commented 2 years ago

Thanks @daniel-weisse , I am able to proceed now.

Also, now I am seeing the same deadlock logs for python-threads example in Azure VM as well .

I will try these steps in our local systems too, just to check if the error is related to gramine or something to do with Azure.

One quick check here: As I can see this repo is created two months ago, are you seeing these deadlock messages this year only or with earlier commits (I mean last year's commits) also you faced similar behavior ?

Thanks

daniel-weisse commented 2 years ago

I have only seen this behavior this year, but I also didn't test any similar code previous to this. The issue should be reproducible with the binary release of Gramine in version v1.1 and v1.0

veenasai2 commented 2 years ago

Thanks @daniel-weisse.

veenasai2 commented 2 years ago

Hi @daniel-weisse,

I was trying to build a docker image using the steps from here for the python-threads example (in Azure VM).

While doing docker run for the above created image, i am facing errors as shown below, whereas same docker run command works fine for the ghcr.io/edgelesssys/threads-test:latest image.

error: Initializing enclave failed: -1 error: load_enclave() failed with error -1

Any step, do you think is missing in the Readme ? Currently there are only two steps mentioned for docker image creation.

Thanks

daniel-weisse commented 2 years ago

What command are you using to run the container? I tried rebuilding the image on an Azure Standard_DC8s_v3 VM, and both the one uploaded to the container registry and the one I have locally are able to initialize the enclave successfully.

I'm using the following to start the container:

docker run -it --rm \
    --network host \
    --device /dev/sgx_enclave \
    --device /dev/sgx_provision \
    -v /dev/sgx:/dev/sgx \
    -v /var/run/aesmd:/var/run/aesmd \
    ${IMAGE_NAME_HERE}
veenasai2 commented 2 years ago

Hi @daniel-weisse ,

I cloned the latest copy of tf-training-sgx repo (at Azure VM Standard DC8s v3) and I was using the below commands to build the image

cd python-threads openssl genrsa -out signing_key.pem 3072 DOCKER_BUILDKIT=1 docker build --secret id=signingkey,src=signing_key.pem -t ${IMAGE_NAME} .

And the following command to run the image

 docker run -e SGX_AESM_ADDR=1 -e EDG_MARBLE_UUID_FILE=uuid -e EDG_MARBLE_TYPE=Threads \
  -it --rm --network host \
  --device /dev/sgx/enclave \
  --device /dev/sgx/provision \
   -v /dev/sgx:/dev/sgx  \
   -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
   ${IMAGE-NAME} 

I am facing the "Initializing enclave failed" error only with the image that I built using the above steps. If I run the one uploaded to the container registry, I don't face any issue.

Thanks

daniel-weisse commented 2 years ago

Regenerate my private key and was able to reproduce this.

Also found the source of the problem and a solution From the Gramine docs: "SGX requires RSA 3072 keys with public exponent equal to 3."

That specification got lost somewhere when writing the READMEs for this repo. The command should be:

openssl genrsa -3 -out signing_key.pem 3072
veenasai2 commented 2 years ago

Hi @daniel-weisse,

Thanks for the above steps, these are helping me to proceed further.

But now I am getting an error "rpc error: code = Unauthenticated desc = invalid quote: PackageProperties not compliant:" while running the python-threads image.

Steps to Reproduce:

  1. Download fresh copy of tf-training-sgx in Azure VM

  2. cd python-threads

  3. openssl genrsa -3 -out signing_key.pem 3072

  4. DOCKER_BUILDKIT=1 docker build --secret id=signingkey,src=signing_key.pem -t localhost/threads .

  5. Start the coordinator using the below command docker run -it --rm --network host --device /dev/sgx/enclave --device /dev/sgx/provision -v /dev/sgx:/dev/sgx ghcr.io/edgelesssys/coordinator

  6. marblerun manifest set manifest.json localhost:4433

  7. docker run -e SGX_AESM_ADDR=1 -e EDG_MARBLE_UUID_FILE=uuid -e EDG_MARBLE_TYPE=Threads -it --rm --network host --device /dev/sgx/enclave --device /dev/sgx/provision -v /dev/sgx:/dev/sgx -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket localhost/threads

  8. Got error "rpc error: code = Unauthenticated desc = invalid quote: PackageProperties not compliant:"

PFA the error logs here. python-threads_logs.txt

Please note I am getting this error only while running the self built image, with registry image I am not facing any issues (using the same docker run command).

Thanks

daniel-weisse commented 2 years ago

Since you are probably not using the same signing key I was using to create your images, you will need to update Packages.threads.SignerID in manifest.json. SignerID is the MRSIGNER value of your signingkey/enclave.

You can get it, for example, using gramine-sgx-get-token:

$ docker run -it --rm --entrypoint bash localhost/threads
$ gramine-sgx-get-token -s python.sig -o /dev/null
Attributes:
    mr_enclave:  9f5533109fda2570ae38c9a3b92a5e1f59045965d220ce78471fe9ebf8437f8a
    mr_signer:   43361affedeb75affee9baec7e054a5e14883213e5a121b67d74a0e12e9d2b7a  # <----- This is the value we want as SignerID
    isv_prod_id: 3
    isv_svn:     1
    attr.flags:  0000000000000004
    attr.xfrm:   00000000000000e7
    mask.flags:  ffffffffffffffff
    mask.xfrm:   fffffffffff9ff1b
    misc_select: 00000000
    misc_mask:   ffffffff
    modulus:     7d5cce28920bec0d8200ef4d54d25bec...
    exponent:    3
    signature:   c676298a54c5940bd8660c89df530654...
    date:        2022-04-01
veenasai2 commented 2 years ago

Hi @daniel-weisse,

Just to update you, we have created a gsc image for python-threads example. The image is giving output similar to ghcr.io/edgelesssys/threads-test:latest image. However, we are not seeing the stuck in termination state issue in AKS environment for the gsc image.

So, if it can unblock you, we can share the gsc image with you. Alternatively, you can also build by your own using this link .

Thanks

veenasai2 commented 2 years ago

Hi @daniel-weisse,

This is the gsc image for python-threads. docker.io/intelnonprodmages/gsc-python-threads-gramine-v1.1-may17

You can put this image name at https://github.com/edgelesssys/tf-training-sgx/blob/master/python-threads/deployment.yaml#L13

Also, the SignerID ( mrsigner value ) is "a29e6967afca7c54811c99d742fd4e60d59501630e8b0ae84c36b6faa4793ca2" for this image.

Note: This image is just for testing purpose with all logs enabled. Please do not use it in production.

I have tested this image in AKS environment and did not face the deadlock issue. Just an additional info: while deploying this image, in the resources section I added "cpu: 6"

Thanks

daniel-weisse commented 2 years ago

Thanks! The pods are terminating as expected. Will try to build my own image and see if this works with a more complex example.

veenasai2 commented 2 years ago

Hi @daniel-weisse , thanks for confirming. Glad it worked :)