OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.15k stars 590 forks source link

InstantOn with Azure #23516

Open tjwatson opened 1 year ago

tjwatson commented 1 year ago

There are two services we are targeting:

jgawor commented 1 year ago

AKS

I used the default selections to create a cluster. Only selected Kubernetes version v1.24 to ensure containerd container engine.

Node information:

$ kubectl get node -o wide
NAME                                STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-agentpool-11140021-vmss000000   Ready    agent   10m   v1.24.6   10.224.0.4    <none>        Ubuntu 18.04.6 LTS   5.4.0-1094-azure   containerd://1.6.4+azure-4

Deployed deployment.yaml and noticed the following restore failure:

$ kubectl logs open-liberty-instanton-64cbb855db-j4xp8
CRIU needs to have the CAP_SYS_ADMIN or the CAP_CHECKPOINT_RESTORE capability: 
setcap cap_checkpoint_restore+eip criu
(00.000000) Effective capability 40 missing
(00.000000) Effective capability 21 missing
CWWKE0957I: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.

Status: The nodes are running an older kernel version so unprivileged restore does not work. Kernel 5.9 or higher is needed. Privileged restore does work.

ACA

Status: The documentation for configuring the containers to deploy does not mention passing any additional Linux capabilities. So, there is no way (that I can find) to make unprivileged restore to work. Also, privileged containers are not supported by ACA.

The restore operation failed with the following in ACA (as expected):

2022-12-07T05:17:48.595178852Z /opt/ol/wlp/bin/server: line 1407: /usr/sbin/criu: Operation not permitted
2022-12-07T05:17:50.185483507Z CWWKE0957I: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
vijaysun-omr commented 1 year ago

Do we feel that ACA will add the ability to pass in Linux capabilities ? Otherwise it seems even an OS upgrade won't be enough in that environment. Maybe you have already asked them that question.

jgawor commented 1 year ago

@vijaysun-omr We are trying to engage with the ACA folks to add that ability.