SSU-DCN / podmigration-operator

MIT License
24 stars 10 forks source link

Failed to migrate pod after installed #6

Closed fjibj closed 1 year ago

fjibj commented 2 years ago

Hi, I followed all the steps in init-cluster-containerd-CRIU.md, rebuild kubeadm, kubelet, kubectl-checkpoint/-migrate,and deploy kubernetes and podmigration-operator in 3 nodes: k8s-master01, k8s-node01,k8s-node02. all node :CentOS Linux release 7.4.1708, core 5.4.179,

When I run the example in podmigration-operator/config/samples/migration-example,

kubectl apply -f 1.yaml # ( kubernetes.io/hostname: k8s-node02)

kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES simple 1/1 Running 0 11h 10.244.1.2 k8s-node02

curl --request POST 'localhost:5000/Podmigrations' --header 'Content-Type: application/json' --data '{"name":"test01", "replicas":1, "action":"live-migration", "sourcePod":"simple", "destHost":"k8s-node01"}'

{ "name": "test01", "destHost": "k8s-node01", "replicas": 1, "selector": { "matchLabels": { "podmig": "dcn" } }, "action": "live-migration", "snapshotPath": "", "sourcePod": "simple", "template": { "metadata": { "creationTimestamp": null }, "spec": { "containers": null } }, "status": { "state": "", "currentRevision": "", "activePod": "" } }

api-server output:

&{test1 k8s-node01 1 &LabelSelector{MatchLabels:map[string]string{podmig: dcn,},MatchExpressions:[]LabelSelectorRequirement{},} live-migration simple {{ 0 0001-01-01 00:00:00 +0000 UTC map[] map[] [] [] []} {[] [] [] [] map[] false false false nil [] nil [] [] nil [] map[] [] }} } simple

but no any migration, the simple pod still run in k8s-node2.

when i use kubectl checkpoint command, some mistakes occured:

kubectl checkpoint simple /var/lib/kubelet/migration/simple

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11b909a]

goroutine 1 [running]: k8s.io/client-go/kubernetes.NewForConfig(0x0, 0x0, 0x144c17d, 0x58) /mnt/disk01/fangjin/projects/containerd/kubernetes/staging/src/k8s.io/client-go/kubernetes/clientset.go:371 +0x3a main.(MigrateArgs).Run(0xc000383290, 0xc00037ea00, 0xc00037c7e0) /mnt/disk01/fangjin/projects/containerd/podmigration-operator/kubectl-plugin/checkpoint-command/checkpoint_command.go:88 +0x73 main.NewPluginCmd.func1(0xc00037ea00, 0xc00037c7e0, 0x2, 0x2) /mnt/disk01/fangjin/projects/containerd/podmigration-operator/kubectl-plugin/checkpoint-command/checkpoint_command.go:61 +0xd8 github.com/spf13/cobra.(Command).execute(0xc00037ea00, 0xc000114160, 0x2, 0x2, 0xc00037ea00, 0xc000114160) /root/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2c2 github.com/spf13/cobra.(Command).ExecuteC(0xc00037ea00, 0x0, 0xffffffff, 0xc000100058) /root/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x30b github.com/spf13/cobra.(Command).Execute(...) /root/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864 main.main() /mnt/disk01/fangjin/projects/containerd/podmigration-operator/kubectl-plugin/checkpoint-command/checkpoint_command.go:130 +0x2a

kubectl migrate simple k8s-node01

response Status: 200 OK { "name": "simple-migration-controller-71", "destHost": "k8s-node01", "replicas": 0, "selector": { "matchLabels": { "podmig": "dcn" } }, "action": "live-migration", "snapshotPath": "", "sourcePod": "simple", "template": { "metadata": { "creationTimestamp": null }, "spec": { "containers": null } }, "status": { "state": "", "currentRevision": "", "activePod": "" } }

But also nothing really happened,

kubectl get pods -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES simple 1/1 Running 0 11h 10.244.1.2 k8s-node02

I need your help, thanks!

vutuong commented 2 years ago

Did you run the podmigration-controller by $ make run command? Could you please send the podmigration-controller log?. In addition, Could you please check the folder /var/lib/kubelet/migration is configured as NFS shared folder arcross 3 node as Step 9 or not?

fjibj commented 2 years ago

thanks for your answer.

  1. nfs is good. every node can r/w nfs.

    cd /var/lib/kubelet/migration/

    [root@k8s-node02 migration]# ll total 4 -rw-r--r-- 1 root root 2 Feb 16 08:59 aaa.txt drw------- 2 nfsnobody nfsnobody 6 Feb 16 13:10 kkk drw------- 2 nfsnobody nfsnobody 6 Feb 16 14:09 simple

  2. i changed checkpointcommand.go: //config, := clientcmd.BuildConfigFromFlags("", "/home/dcn/fault-detection/docs/anisble-playbook/kubernetes-the-hard-way/admin.kubeconfig") config, _ := clientcmd.BuildConfigFromFlags("", "/root/.kube/config")

kubectl checkpoint simple /var/lib/kubelet/migration/simple

hang on 2 hours .... only create /var/lib/kubelet/migration/simple directory and nothing in it .

  1. the podmigration-controller log:

    make run

    which: no controller-gen in (/mnt/disk01/kube/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/var/lib/snapd/snap/bin:/root/bin:/usr/lib/golang/bin:/root/bin) go: creating new go.mod: module tmp go get: added sigs.k8s.io/controller-tools v0.2.5 /root/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..." go fmt ./... go vet ./... /root/go/bin/controller-gen "crd:trivialVersions=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases go run ./main.go 2022-02-16T13:10:32.144+0800 INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8081"} 2022-02-16T13:10:32.145+0800 INFO setup starting manager 2022-02-16T13:10:32.245+0800 INFO controller-runtime.manager starting metrics server {"path": "/metrics"} 2022-02-16T13:10:32.245+0800 INFO controller Starting EventSource {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "source": "kind source: /, Kind="} 2022-02-16T13:10:32.346+0800 INFO controller Starting Controller {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration"} 2022-02-16T13:10:32.346+0800 INFO controller Starting workers {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "worker count": 1} 2022-02-16T13:10:32.347+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "print test": {"replicas":1,"sourcePod":"simple","destHost":"k8s-node01","selector":{"matchLabels":{"podmig":"dcn"}},"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":[]}},"action":"live-migration"}} 2022-02-16T13:10:32.349+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "annotations ": ""} 2022-02-16T13:10:32.349+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "number of existing pod ": 0} 2022-02-16T13:10:32.349+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "desired pod ": {"namespace": "default", "name": ""}} 2022-02-16T13:10:32.349+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "number of desired pod ": 1} 2022-02-16T13:10:32.350+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "number of actual running pod ": 0} 2022-02-16T13:10:32.380+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "Live-migration": "Step 1 - Check source pod is exist or not - completed"} 2022-02-16T13:10:32.380+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "sourcePod ok ": {"apiVersion": "v1", "kind": "Pod", "namespace": "default", "name": "simple"}} 2022-02-16T13:10:32.380+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "sourcePod status ": "Running"} 2022-02-16T13:10:32.387+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "Live-migration": "Step 2 - checkpoint source Pod - completed"} 2022-02-16T13:10:32.387+0800 INFO controllers.Podmigration {"podmigration": "default/test1", "live-migration pod": "count"}

the log only print above when i first input "curl --request POST 'localhost:5000/Podmigrations'...", nothing else even i exec the curl command again.

vutuong commented 2 years ago

Well. It sound like there is a problem with CRIU running in Centos 7. My setup was running with Ubuntu 18.04 so maybe you need to check if CRIU work well in Centos. Could you please try to install non-Centos CRIU as https://github.com/checkpoint-restore/criu/issues/559

fjibj commented 2 years ago

I installed non-Centos CRIU on 3 nodes, but it seems not good.

criu check --all

Warn (criu/cr-check.c:861): Dirty tracking is OFF. Memory snapshot will not work. Warn (criu/cr-check.c:1279): clone3() with set_tid not supported Error (criu/cr-check.c:1321): Time namespaces are not supported Error (criu/cr-check.c:1331): IFLA_NEW_IFINDEX isn't supported Warn (criu/cr-check.c:1353): Pidfd store requires pidfd_getfd syscall which is not supported Warn (criu/cr-check.c:1374): Nftables based locking requires libnftables and set concatenations support Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure.

and kubectl checkpoint still not work....how can I get more logs of checkpoint command or CRIU?

vutuong commented 2 years ago

You can check the log of kubelet in the source worker node ( in your case is worker-node2) by using journalctl -xf-u kubelet. This log should give the information about checkpoint process. Anyway, please try to swap source node and destination node, I mean Could you please check if you can migrate from node 1 to node 2?

fjibj commented 2 years ago

Thank you very much! It's exactly the runc question. I have a mistake in runtime_engine of /etc/containerd/config.toml. I changed it and exec "kubectl checkpoint" successed.

but new questions:

  1. after checkpoint command, I found the state of pod/simple is Pending, then exec migrate command [root@k8s-master01 ~]# kubectl migrate simp k8s-node01 response Status: 400 Bad Request { "title": "Bad Request", "details": "Could not find sourcePod for migration" } why not find sourcePod? how to restore or migrate a checkpointed pod? I exec command to migrate anothere pod/simple2 which not checkpointed, the same "not find sourcePod" responsed.

  2. But "curl --request POST 'localhost:5000/Podmigrations' ..." looks good (not real),

    kubectl get pod -o wide

    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES simple2 1/1 Running 0 43m 10.244.1.5 k8s-node02 simple2-migration-39 0/1 ContainerCreating 0 38m k8s-node01

the controller log: 2022-02-17T17:28:37.582+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "print test": {"replicas":1,"sourcePod":"simple2","destHost":"k8s-node01","selector":{"matchLabels":{"podmig":"dcn"}},"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":[]}},"action":"live-migration"}} 2022-02-17T17:28:37.583+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "annotations ": ""} 2022-02-17T17:28:37.583+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "number of existing pod ": 0} 2022-02-17T17:28:37.583+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "desired pod ": {"namespace": "default", "name": ""}} 2022-02-17T17:28:37.583+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "number of desired pod ": 1} 2022-02-17T17:28:37.583+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "number of actual running pod ": 0} 2022-02-17T17:28:37.618+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "Live-migration": "Step 1 - Check source pod is exist or not - completed"} 2022-02-17T17:28:37.618+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "sourcePod ok ": {"apiVersion": "v1", "kind": "Pod", "namespace": "default", "name": "simple2"}} 2022-02-17T17:28:37.618+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "sourcePod status ": "Running"} 2022-02-17T17:28:37.625+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "Live-migration": "Step 2 - checkpoint source Pod - completed"} 2022-02-17T17:28:37.625+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "live-migration pod": "count"} 2022-02-17T17:28:38.027+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "Live-migration": "checkpointPath/var/lib/kubelet/migration/kkk/simple2"} 2022-02-17T17:28:38.027+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "Live-migration": "Step 3 - Wait until checkpoint info are created - completed"} 2022-02-17T17:28:38.039+0800 INFO controllers.Podmigration {"podmigration": "default/test3", "Live-migration": "Step 4 - Restore destPod from sourcePod's checkpointed info - completed"}

journalctl -xfu kubelet Feb 17 17:31:25 k8s-node01 kubelet[2973]: E0217 17:31:25.766840 2973 kuberuntime_manager.go:732] createPodSandbox for pod "simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/26c4468a7316906541f9f23b153b03bf2bea56f81caabae508db72c05ec69f3b/log.json: no such file or directory): fork/exec /etc/containerd/config.toml: permission denied: unknown Feb 17 17:31:25 k8s-node01 kubelet[2973]: E0217 17:31:25.766906 2973 pod_workers.go:191] Error syncing pod db87e73f-c6b5-4108-b256-e2ae276f0fbc ("simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)"), skipping: failed to "CreatePodSandbox" for "simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)" with CreatePodSandboxError: "CreatePodSandbox for pod \"simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)\" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/26c4468a7316906541f9f23b153b03bf2bea56f81caabae508db72c05ec69f3b/log.json: no such file or directory): fork/exec /etc/containerd/config.toml: permission denied: unknown" Feb 17 17:31:38 k8s-node01 kubelet[2973]: E0217 17:31:38.765195 2973 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/60f408b8f75a93f1de487f7da3f716d1eb9b20c1f87124a7b20690698e23af82/log.json: no such file or directory): fork/exec /etc/containerd/config.toml: permission denied: unknown Feb 17 17:31:38 k8s-node01 kubelet[2973]: E0217 17:31:38.765266 2973 kuberuntime_sandbox.go:69] CreatePodSandbox for pod "simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/60f408b8f75a93f1de487f7da3f716d1eb9b20c1f87124a7b20690698e23af82/log.json: no such file or directory): fork/exec /etc/containerd/config.toml: permission denied: unknown Feb 17 17:31:38 k8s-node01 kubelet[2973]: E0217 17:31:38.765298 2973 kuberuntime_manager.go:732] createPodSandbox for pod "simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/60f408b8f75a93f1de487f7da3f716d1eb9b20c1f87124a7b20690698e23af82/log.json: no such file or directory): fork/exec /etc/containerd/config.toml: permission denied: unknown Feb 17 17:31:38 k8s-node01 kubelet[2973]: E0217 17:31:38.765372 2973 pod_workers.go:191] Error syncing pod db87e73f-c6b5-4108-b256-e2ae276f0fbc ("simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)"), skipping: failed to "CreatePodSandbox" for "simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)" with CreatePodSandboxError: "CreatePodSandbox for pod \"simple2-migration-39_default(db87e73f-c6b5-4108-b256-e2ae276f0fbc)\" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/k8s.io/60f408b8f75a93f1de487f7da3f716d1eb9b20c1f87124a7b20690698e23af82/log.json: no such file or directory): fork/exec /etc/containerd/config.toml: permission denied: unknown

my /etc/containerd/config.toml: ... [plugins."io.containerd.grpc.v1.cri".containerd] snapshotter = "overlayfs" default_runtime_name = "runc" no_pivot = false disable_snapshot_annotations = false discard_unpacked_layers = false [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime] runtime_type = "io.containerd.runtime.v1.linux" runtime_engine = "/etc/containerd/config.toml" runtime_root = "" privileged_without_host_devices = false base_runtime_spec = "" [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime] runtime_type = "" runtime_engine = "" ...

Is something wrong in config.toml?

vutuong commented 2 years ago

Well, in my minor implementation, migrate process includes both checkpoint and restore process. So if you want to migrate the pod, in your case simp, firstly, you need to have a pod running with the name of simp, then you can just simply run the command kubectl migrate simp k8s-node01. Then the pod will be checkpointed and restore to k8s-node01. You dont need to checkpoint before restore. As I see, your pod name is simple2, so the log said there is no pod with the name of simp running in your cluster. As you said that your checkpoint process is successed, It means the checkpointed data are created in /var/lib/kubelet/migration/. I think it should work with migrate process, too.

P/S: In my scope, I created the checkpoint because I need to save the checkpoint info as an image. Then, based on this image I can start the other pod that run the application from checkpointed point without start the app from scratch.

I don't create the kubectl restore command to do the restore at the time I wrote this document but if you want you are wellcome to contribute to it.

fjibj commented 2 years ago

[root@k8s-master01 ~]# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES simple 1/1 Terminating 0 21h 10.244.1.6 k8s-node02 simple-migration-29 1/1 Running 0 8m27s 10.244.2.111 k8s-node01

OK!

fjibj commented 2 years ago

thanks a lot. what you say "...the checkpoint info as an image. Then, based on this image I can start the other pod that run the application from checkpointed point without start the app from scratch." this is exactly what I want. An image should include checkpoint info and can be started and restore from checkpoint in other pod at anytime and anywhere.

vutuong commented 2 years ago

Oke. But the restore process need a template to init a pod, and then load the checkpointed info into it to restore the app. So the restore is not just only a single command like checkpoint but it needs a template to init a new pod first. That why I didn't create a kubectl checkpoint command. However, you can use the sample template in https://github.com/SSU-DCN/podmigration-operator/blob/main/config/samples/podmig_v1_restore.yaml to create start applications from checkpoint image with the path of checkpoint data, called snapshotPath, defined inside.

fjibj commented 2 years ago

when I exec "kubectl checkpoint" to a Centos7 pod, it is fail.

dump.log: (80.152763) mnt: Inspecting sharing on 403 shared_id 153 master_id 0 (@./proc) (80.152766) mnt: Inspecting sharing on 402 shared_id 152 master_id -1 (@./) (80.152772) Error (criu/mount.c:627): mnt: FS mnt ./sys/kernel/config dev 0x27 root / unsupported id 386 (80.152791) Unlock network (80.152795) Running network-unlock scripts (80.152798) RPC (80.157163) Unfreezing tasks into 1 (80.157178) Unseizing 31321 into 1 (80.157193) Unseizing 31355 into 1 (80.157203) Unseizing 31377 into 1 (80.157212) Unseizing 31382 into 1 (80.157225) Unseizing 31383 into 1 (80.157231) Unseizing 31405 into 1 (80.157292) Unseizing 31428 into 1 (80.157307) Unseizing 31429 into 1 (80.157358) Error (criu/cr-dump.c:1788): Dumping FAILED.

you konw how to fix it?

vutuong commented 2 years ago

Is our document worked with a simple example pod ? If it work may be there is a problem with your application in your pod, I mean CRIU can not handle to checkpoint your pod. Could you please give the information about how you run your pod and your pod yaml file ?

fjibj commented 2 years ago

In detail,I run a pod support remote access, whose image pull from a private Harbor registry

kubectl apply -f rdp.yaml

rdp.yaml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    name: centosrdp
    rdpserver: centosrdp
  name: centosrdp
spec:
  replicas: 1
  selector:
    matchLabels:
      appname: centosrdp
  serviceName: centosrdp-inner
  template:
    metadata:
      labels:
        appname: centosrdp
        rdpserver: centosrdp
      annotations:
        cni.projectcalico.org/ipAddrs: "[\"172.20.200.204\"]"
    spec:
      containers:
      - image: 172.32.150.15/nlsxpt_raw/centos-xrdp-pinyin:7.6.1810.20220216
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        name: centosrdp
        resources:
          limits:
            cpu: 1500m
            memory: 3000Mi
          requests:
            cpu: 500m
            memory: 500Mi
        ports:
        - containerPort: 3389
          protocol: TCP
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
        volumeMounts:
          - mountPath: /media
            name: installpkg
            readOnly: true
      restartPolicy: Always
      securityContext: {}
      volumes:
      - hostPath:
          path: /home
        name: installpkg
---
apiVersion: v1
kind: Service
metadata:
  name: centosrdp
spec:
  ports:
  - name: rdp
    nodePort: 31004
    port: 3389
    protocol: TCP
    targetPort: 3389
  - name: ssh
    nodePort: 32004
    port: 22
    protocol: TCP
    targetPort: 22
  selector:
    rdpserver: centosrdp
  sessionAffinity: None
  type: NodePort

# kubectl get pod -o wide
NAME          READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
centosrdp-0   1/1     Running   0          15h   10.244.2.118   k8s-node01   <none>           <none>

it' ok, I can access it with win10 remote desktop.

but when I execute checkpoint command:

# kubectl checkpoint centosrdp-0 /var/lib/kubelet/migration/
Operation cannot be fulfilled on pods "centosrdp-0": the object has been modified; please apply your changes to the latest version and try again

in k8s-node1's terminal:

# journalctl -xfu kubelet
-- Logs begin at Sat 2022-02-12 09:09:04 CST. --
Feb 23 08:38:39 k8s-node01 kubelet[479]: I0223 08:38:39.417202     479 kuberuntime_manager.go:841] Should we migrate?Runningfalse
......
Feb 23 08:45:18 k8s-node01 kubelet[479]: I0223 08:45:18.661708     479 kubelet.go:1505] Checkpoint the firstime running pod to use for other scale without booting from scratch: %+vcentosrdp-0
Feb 23 08:45:18 k8s-node01 kubelet[479]: E0223 08:45:18.732759     479 remote_runtime.go:289] CheckpointContainer "78e59f6d3b57068f525943f73e68687201a54aead6a4eb4d00adbd3ed763c659" from runtime service failed: rpc error: code = Unknown desc = failed to checkpoint container: /usr/local/bin/runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/k8s.io/78e59f6d3b57068f525943f73e68687201a54aead6a4eb4d00adbd3ed763c659/criu-dump.log: unknown
Feb 23 08:45:18 k8s-node01 kubelet[479]: I0223 08:45:18.733495     479 kuberuntime_manager.go:841] Should we migrate?Runningfalse
Feb 23 08:45:27 k8s-node01 kubelet[479]: I0223 08:45:27.417173     479 kuberuntime_manager.go:841] Should we migrate?Runningfalse

Then I get

# cat /run/containerd/io.containerd.runtime.v1.linux/k8s.io/78e59f6d3b57068f525943f73e68687201a54aead6a4eb4d00adbd3ed763c659/criu-dump.log
......
(00.012507) mnt: Inspecting sharing on 402 shared_id 152 master_id -1 (@./)
(00.012513) Error (criu/mount.c:627): mnt: FS mnt ./sys/kernel/config dev 0x27 root / unsupported id 386
(00.012532) Unlock network
(00.012536) Running network-unlock scripts
(00.012539)     RPC
(00.016382) Unfreezing tasks into 1
(00.016401)     Unseizing 31321 into 1
(00.016419)     Unseizing 31355 into 1
(00.016426)     Unseizing 31377 into 1
(00.016435)     Unseizing 31382 into 1
(00.016465)     Unseizing 31383 into 1
(00.016471)     Unseizing 31405 into 1
(00.016520)     Unseizing 31428 into 1
(00.016530)     Unseizing 31429 into 1
(00.016582) Error (criu/cr-dump.c:1788): Dumping FAILED.

It seems like a mistake of criu to mount something to /.

vutuong commented 2 years ago

I not sure but I think problem related to the mount that you set in the pod. Maybe CRIU can not handle if you set the volumeMounts: to readOnly: true. Please try again without it. And please note that in our implementation, if you want to restore the pod with the mount info, you have to move the mount info along with the pod itself because if you try to restore the pod at another node, the new node should have mount path info same as source node.

vutuong commented 2 years ago

hi @fjibj Did my response answer your question and can I close this issue? Thanks

120L020314 commented 9 months ago

@fjibj hello,sorry to disturb,i have the same error,when i run kubectl checkpoint or migrate,nothing happens.how do you solver this problem,can you teach me?thank you very much!