SSU-DCN / podmigration-operator

MIT License
23 stars 10 forks source link

OCI runtime restore failed: criu failed: type RESTORE errno 0 #15

Closed Paper-Dragon closed 1 year ago

Paper-Dragon commented 1 year ago

Hi @vutuong :

Problem Description

When I execute the command kubectl migrate simple k8s-node2 I get status StartError I'm sure nfs is properly configured and has r/w permissions I think the migration was successful, but the restore is not normal I need your help badly and Merry Christmas to you too! !

When I check the nfs path is like this

root@k8s-master:~# ls -lh /var/lib/kubelet/migration/kkk/simple/
total 4.0K
drwxr-xr-x 2 root root 4.0K 12月 29 09:42 count
root@k8s-master:~# ls -lh /var/lib/kubelet/migration/kkk/simple/count/
total 8.0K
-rw------- 1 root root 47 12月 29 09:42 descriptors.json
-rw-r--r-- 1 root root 12 12月 29 09:42 seccomp.img
root@k8s-master:~# cat /var/lib/kubelet/migration/kkk/simple/count/descriptors.json
["/dev/null","pipe:[5431577]","pipe:[5431578]"]
root@k8s-master:~# cat /var/lib/kubelet/migration/kkk/simple/count/seccomp.img
CVTI0Ad
root@k8s-master:~#

Below is my execution process

# kubectl create -f 1.yaml
pod/simple created
# kubectl migrate simple k8s-node2
response Status: 200 OK
{
 "name": "simple-migration-controller-64",
 "destHost": "k8s-node2",
 "replicas": 0,
 "selector": {
  "matchLabels": {
   "podmig": "dcn"
  }
 },
 "action": "live-migration",
 "snapshotPath": "",
 "sourcePod": "simple",
 "template": {
  "metadata": {
   "creationTimestamp": null
  },
  "spec": {
   "containers": null
  }
 },
 "status": {
  "state": "",
  "currentRevision": "",
  "activePod": ""
 }
}

# kubectl get po
NAME                  READY   STATUS        RESTARTS   AGE
simple                1/1     Terminating   0          23s
simple-migration-94   0/1     StartError    0          4s

# kubectl describe po simple-migration-94
Name:         simple-migration-94
Namespace:    default
Priority:     0
Node:         k8s-node2/11.0.1.138
Start Time:   Thu, 29 Dec 2022 09:42:11 +0800
Labels:       name=simple
Annotations:  snapshotPath: /var/lib/kubelet/migration/kkk/simple
              snapshotPolicy: restore
              sourcePod: simple
Status:       Running
IP:           10.244.2.5
IPs:
  IP:           10.244.2.5
Controlled By:  Podmigration/simple-migration-controller-64
Containers:
  count:
    Container ID:  containerd://961d9bd7dc735ff7c7ee19ea63d9d78e14950e940b7db8c5950fa79f387d11c3
    Image:         alpine
    Image ID:      docker.io/library/alpine@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
    Port:          80/TCP
    Host Port:     0/TCP
    State:         Terminated
      Reason:      StartError
      Message:     failed to start containerd task "961d9bd7dc735ff7c7ee19ea63d9d78e14950e940b7db8c5950fa79f387d11c3": OCI runtime restore failed: criu failed: type RESTORE errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/961d9bd7dc735ff7c7ee19ea63d9d78e14950e940b7db8c5950fa79f387d11c3/restore.log: unknown
      Exit Code:  128
      Started:    Thu, 01 Jan 1970 08:00:00 +0800
      Finished:   Thu, 29 Dec 2022 09:42:25 +0800
    Last State:   Terminated
      Reason:     StartError
      Message:    failed to start containerd task "953423e44ff50197139ca09d08697258d37ca38f1a90c73eb337b29381291e80": OCI runtime restore failed: criu failed: type RESTORE errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/953423e44ff50197139ca09d08697258d37ca38f1a90c73eb337b29381291e80/restore.log: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 08:00:00 +0800
      Finished:     Thu, 29 Dec 2022 09:42:23 +0800
    Ready:          False
    Restart Count:  5
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fn8v9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-fn8v9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-fn8v9
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kubernetes.io/hostname=k8s-node2
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age               From                Message
  ----    ------     ----              ----                -------
  Normal  Scheduled  15s                                   Successfully assigned default/simple-migration-94 to k8s-node2
  Normal  Pulled     13s               kubelet, k8s-node2  Successfully pulled image "alpine" in 1.896999236s
  Normal  Pulled     11s               kubelet, k8s-node2  Successfully pulled image "alpine" in 1.862067847s
  Normal  Pulled     9s                kubelet, k8s-node2  Successfully pulled image "alpine" in 1.88265987s
  Normal  Pulled     7s                kubelet, k8s-node2  Successfully pulled image "alpine" in 1.800737119s
  Normal  Pulled     4s                kubelet, k8s-node2  Successfully pulled image "alpine" in 2.293742079s
  Normal  Created    2s (x6 over 13s)  kubelet, k8s-node2  Created container count
  Normal  Started    2s (x6 over 13s)  kubelet, k8s-node2  Restored container count from checkpoint /var/lib/kubelet/migration/kkk/simple/count
  Normal  Pulled     2s                kubelet, k8s-node2  Successfully pulled image "alpine" in 1.775652059s
  Normal  Pulling    1s (x7 over 15s)  kubelet, k8s-node2  Pulling image "alpine"

root@k8s-node2:~# journalctl -xe -u kubelet -f

d3561" from runtime service failed: rpc error: code = FailedPrecondition desc = failed to delete containerd container "504d19001e5ce45ed7e38189415165e88e2f86284528d5046bb129ae6dfd3561": cannot delete running task 504d19001e5ce45ed7e38189415165e88e2f86284528d5046bb129ae6dfd3561: failed precondition
12月 29 09:59:39 k8s-node2 kubelet[902]: E1229 09:59:39.512177     902 kuberuntime_gc.go:146] Failed to remove container "504d19001e5ce45ed7e38189415165e88e2f86284528d5046bb129ae6dfd3561": rpc error: code = FailedPrecondition desc = failed to delete containerd container "504d19001e5ce45ed7e38189415165e88e2f86284528d5046bb129ae6dfd3561": cannot delete running task 504d19001e5ce45ed7e38189415165e88e2f86284528d5046bb129ae6dfd3561: failed precondition
12月 29 09:59:39 k8s-node2 kubelet[902]: I1229 09:59:39.512186     902 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7
12月 29 09:59:39 k8s-node2 kubelet[902]: E1229 09:59:39.513015     902 remote_runtime.go:325] RemoveContainer "a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7" from runtime service failed: rpc error: code = FailedPrecondition desc = failed to delete containerd container "a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7": cannot delete running task a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7: failed precondition
12月 29 09:59:39 k8s-node2 kubelet[902]: E1229 09:59:39.513043     902 kuberuntime_gc.go:146] Failed to remove container "a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7": rpc error: code = FailedPrecondition desc = failed to delete containerd container "a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7": cannot delete running task a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7: failed precondition
12月 29 09:59:42 k8s-node2 kubelet[902]: I1229 09:59:42.941764     902 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7
12月 29 09:59:42 k8s-node2 kubelet[902]: I1229 09:59:42.941915     902 kuberuntime_manager.go:841] Should we migrate?Runningtrue
12月 29 09:59:45 k8s-node2 kubelet[902]: E1229 09:59:45.180272     902 remote_runtime.go:224] CreateContainer in sandbox "5739f83ef83ca7822ae9b96f7e4bebe425d4bbd1c36101643665d1c566ac14a7" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26": name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26" is reserved for "5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990"
12月 29 09:59:45 k8s-node2 kubelet[902]: E1229 09:59:45.180334     902 kuberuntime_manager.go:867] container creation failed: CreateContainerError: failed to reserve container name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26": name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26" is reserved for "5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990"
12月 29 09:59:45 k8s-node2 kubelet[902]: E1229 09:59:45.180356     902 pod_workers.go:191] Error syncing pod 86ff7103-b773-41c3-b3f6-3b1f16233189 ("simple-migration-94_default(86ff7103-b773-41c3-b3f6-3b1f16233189)"), skipping: failed to "StartContainer" for "count" with CreateContainerError: "failed to reserve container name \"count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26\": name \"count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26\" is reserved for \"5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990\""
12月 29 09:59:56 k8s-node2 kubelet[902]: I1229 09:59:56.941887     902 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7
12月 29 09:59:56 k8s-node2 kubelet[902]: I1229 09:59:56.942025     902 kuberuntime_manager.go:841] Should we migrate?Runningtrue
12月 29 09:59:58 k8s-node2 kubelet[902]: E1229 09:59:58.978082     902 remote_runtime.go:224] CreateContainer in sandbox "5739f83ef83ca7822ae9b96f7e4bebe425d4bbd1c36101643665d1c566ac14a7" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26": name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26" is reserved for "5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990"
12月 29 09:59:58 k8s-node2 kubelet[902]: E1229 09:59:58.978149     902 kuberuntime_manager.go:867] container creation failed: CreateContainerError: failed to reserve container name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26": name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26" is reserved for "5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990"
12月 29 09:59:58 k8s-node2 kubelet[902]: E1229 09:59:58.978170     902 pod_workers.go:191] Error syncing pod 86ff7103-b773-41c3-b3f6-3b1f16233189 ("simple-migration-94_default(86ff7103-b773-41c3-b3f6-3b1f16233189)"), skipping: failed to "StartContainer" for "count" with CreateContainerError: "failed to reserve container name \"count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26\": name \"count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26\" is reserved for \"5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990\""
12月 29 10:00:09 k8s-node2 kubelet[902]: I1229 10:00:09.941740     902 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: a2d875902f4a441b441ad0552f529288fabfe815139395a3a0a104fdd3d1d6d7
12月 29 10:00:09 k8s-node2 kubelet[902]: I1229 10:00:09.941862     902 kuberuntime_manager.go:841] Should we migrate?Runningtrue
12月 29 10:00:11 k8s-node2 kubelet[902]: E1229 10:00:11.740547     902 remote_runtime.go:224] CreateContainer in sandbox "5739f83ef83ca7822ae9b96f7e4bebe425d4bbd1c36101643665d1c566ac14a7" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26": name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26" is reserved for "5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990"
12月 29 10:00:11 k8s-node2 kubelet[902]: E1229 10:00:11.740605     902 kuberuntime_manager.go:867] container creation failed: CreateContainerError: failed to reserve container name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26": name "count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26" is reserved for "5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990"
12月 29 10:00:11 k8s-node2 kubelet[902]: E1229 10:00:11.740628     902 pod_workers.go:191] Error syncing pod 86ff7103-b773-41c3-b3f6-3b1f16233189 ("simple-migration-94_default(86ff7103-b773-41c3-b3f6-3b1f16233189)"), skipping: failed to "StartContainer" for "count" with CreateContainerError: "failed to reserve container name \"count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26\": name \"count_simple-migration-94_default_86ff7103-b773-41c3-b3f6-3b1f16233189_26\" is reserved for \"5e9789c5e2b8749eca8ab08662ca9a815896964fcc4d48ab3ab3a10f457c4990\""

When I run the command crictl ps -a on the migrated node, I found that many pods have been created, but the status is Exited

root@k8s-node2:~# crictl ps -a
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
ERRO[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory"
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
ERRO[0000] unable to determine image API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory"
CONTAINER           IMAGE               CREATED             STATE               NAME                 ATTEMPT             POD ID              POD
5e9789c5e2b87       49176f190c7e9       15 minutes ago      Exited              count                26                  5739f83ef83ca       simple-migration-94
a2d875902f4a4       49176f190c7e9       15 minutes ago      Exited              count                25                  5739f83ef83ca       simple-migration-94
504d19001e5ce       49176f190c7e9       15 minutes ago      Exited              count                24                  5739f83ef83ca       simple-migration-94
ee150ca848d35       49176f190c7e9       15 minutes ago      Exited              count                23                  5739f83ef83ca       simple-migration-94
856ffaa9cf7e8       49176f190c7e9       15 minutes ago      Exited              count                22                  5739f83ef83ca       simple-migration-94
57da389ba2c98       49176f190c7e9       15 minutes ago      Exited              count                21                  5739f83ef83ca       simple-migration-94
e14fdc9ed14e7       49176f190c7e9       15 minutes ago      Exited              count                20                  5739f83ef83ca       simple-migration-94
704683c95a26c       49176f190c7e9       15 minutes ago      Exited              count                19                  5739f83ef83ca       simple-migration-94
a82eb0236cd66       49176f190c7e9       15 minutes ago      Exited              count                18                  5739f83ef83ca       simple-migration-94
dd11c68443edb       49176f190c7e9       15 minutes ago      Exited              count                17                  5739f83ef83ca       simple-migration-94
b5cdeb280d516       49176f190c7e9       15 minutes ago      Exited              count                16                  5739f83ef83ca       simple-migration-94
5abd71ac5fc51       49176f190c7e9       15 minutes ago      Exited              count                15                  5739f83ef83ca       simple-migration-94
799fd314a7e69       49176f190c7e9       15 minutes ago      Exited              count                14                  5739f83ef83ca       simple-migration-94
e2b4c150e1938       49176f190c7e9       15 minutes ago      Exited              count                13                  5739f83ef83ca       simple-migration-94
98cfb270f0b6f       49176f190c7e9       15 minutes ago      Exited              count                12                  5739f83ef83ca       simple-migration-94
c85eb43a8951c       49176f190c7e9       15 minutes ago      Exited              count                11                  5739f83ef83ca       simple-migration-94
f9c908f97071c       49176f190c7e9       15 minutes ago      Exited              count                10                  5739f83ef83ca       simple-migration-94
dfabe3e2afc3a       49176f190c7e9       15 minutes ago      Exited              count                9                   5739f83ef83ca       simple-migration-94
0057906df9734       49176f190c7e9       15 minutes ago      Exited              count                8                   5739f83ef83ca       simple-migration-94
4f91e6727a276       49176f190c7e9       15 minutes ago      Exited              count                7                   5739f83ef83ca       simple-migration-94
d851fea9e8985       49176f190c7e9       15 minutes ago      Exited              count                6                   5739f83ef83ca       simple-migration-94
961d9bd7dc735       49176f190c7e9       15 minutes ago      Exited              count                5                   5739f83ef83ca       simple-migration-94
953423e44ff50       49176f190c7e9       15 minutes ago      Exited              count                4                   5739f83ef83ca       simple-migration-94
890e62d37d3ea       49176f190c7e9       15 minutes ago      Exited              count                3                   5739f83ef83ca       simple-migration-94
3c43cbb3cb816       49176f190c7e9       16 minutes ago      Exited              count                2                   5739f83ef83ca       simple-migration-94
ec73b2bbda076       49176f190c7e9       16 minutes ago      Exited              count                1                   5739f83ef83ca       simple-migration-94
e69365c951050       49176f190c7e9       16 minutes ago      Exited              count                0                   5739f83ef83ca       simple-migration-94
f00f049b73b69       b5c6c9203f83e       20 minutes ago      Running             kube-flannel         2                   b10310a256475       kube-flannel-ds-hx2fk
a5efc9c031d98       8bbb057ceb165       20 minutes ago      Running             kube-proxy           2                   9b65de4bb8e27       kube-proxy-4njcv
fb24e1ee7c5a5       b5c6c9203f83e       20 minutes ago      Exited              install-cni          0                   b10310a256475       kube-flannel-ds-hx2fk
bd8b2d4fa60e3       7a2dcab94698c       20 minutes ago      Exited              install-cni-plugin   2                   b10310a256475       kube-flannel-ds-hx2fk
451c8111606a0       b5c6c9203f83e       23 hours ago        Exited              kube-flannel         1                   61f9d14dd2fb7       kube-flannel-ds-hx2fk
1b9a96ed873aa       8bbb057ceb165       23 hours ago        Exited              kube-proxy           1                   7881de94e9c36       kube-proxy-4njcv

make run

/root/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
/root/go/bin/controller-gen "crd:trivialVersions=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases
go run ./main.go
2022-12-29T09:38:48.623+0800    INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": ":8081"}
2022-12-29T09:38:48.623+0800    INFO    setup   starting manager
2022-12-29T09:38:48.724+0800    INFO    controller-runtime.manager      starting metrics server {"path": "/metrics"}
2022-12-29T09:38:48.724+0800    INFO    controller      Starting EventSource    {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "source": "kind source: /, Kind="}
2022-12-29T09:38:48.824+0800    INFO    controller      Starting Controller     {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration"}
2022-12-29T09:38:48.824+0800    INFO    controller      Starting workers        {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "worker count": 1}
2022-12-29T09:42:11.034+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "print test": {"sourcePod":"simple","destHost":"k8s-node2","selector":{"matchLabels":{"podmig":"dcn"}},"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":[]}},"action":"live-migration"}}
2022-12-29T09:42:11.037+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "annotations ": ""}
2022-12-29T09:42:11.037+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "number of existing pod ": 0}
2022-12-29T09:42:11.037+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "desired pod ": {"namespace": "default", "name": ""}}
2022-12-29T09:42:11.037+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "number of desired pod ": 0}
2022-12-29T09:42:11.037+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "number of actual running pod ": 0}
2022-12-29T09:42:11.056+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "Step 1 - Check source pod is exist or not - completed"}
2022-12-29T09:42:11.056+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "sourcePod ok ": {"apiVersion": "v1", "kind": "Pod", "namespace": "default", "name": "simple"}}
2022-12-29T09:42:11.056+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "sourcePod status ": "Running"}
2022-12-29T09:42:11.061+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "Step 2 - checkpoint source Pod - completed"}
2022-12-29T09:42:11.061+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "live-migration pod": "count"}
2022-12-29T09:42:11.764+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "checkpointPath/var/lib/kubelet/migration/kkk/simple"}
2022-12-29T09:42:11.764+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "Step 3 - Wait until checkpoint info are created - completed"}
2022-12-29T09:42:11.768+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "Step 4 - Restore destPod from sourcePod's checkpointed info - completed"}
2022-12-29T09:42:15.176+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "Step 4.1 - Check whether if newPod is Running or not - completedsimple-migration-94Running"}
2022-12-29T09:42:15.176+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "Step 4.1 - Check whether if newPod is Running or not - completed"}
2022-12-29T09:42:15.181+0800    INFO    controllers.Podmigration                {"podmigration": "default/simple-migration-controller-64", "Live-migration": "Step 6 - Delete the source pod - completed"}
2022-12-29T09:42:15.181+0800    DEBUG   controller      Successfully Reconciled {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "name": "simple-migration-controller-64", "namespace": "default"}

go run ./api-server/cmd/main.go

2022-12-29T09:40:25.556+0800    INFO    podmigration-cp.run     starting api-server manager
2022-12-29T09:40:25.556+0800    INFO    api-server      Starting api-server     {"interface": "0.0.0.0", "port": ":5000"}
&{simple-migration-controller-64 k8s-node2 0 &LabelSelector{MatchLabels:map[string]string{podmig: dcn,},MatchExpressions:[]LabelSelectorRequirement{},} live-migration  simple {{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []} {[] [] [] []  <nil> <nil>  map[]   <nil>  false false false <nil> nil []   nil  [] []  <nil> nil [] <nil> <nil> <nil> map[] [] <nil> }} <nil>}
simple

what did i do?

I created a pod with 1.yaml, as follows

apiVersion: v1
kind: Pod
metadata:
  name: simple
  labels:
    name: simple
  #annotations:
    #snapshotPolicy: "checkpoint"
    #snapshotPath: "/var/lib/kubelet/migration/abc"
spec:
  containers:
  - name: count
    image: alpine
    # imagePullPolicy: IfNotPresent
    command: ["/bin/ash", "-c", "i=1; while true; do echo $i; i=$((i+1)); sleep 1; done"]
    ports:
    - containerPort: 80
    resources:
      limits:
        memory: "128Mi"
        cpu: "600m"
  nodeSelector:
    kubernetes.io/hostname: k8s-node1

and then run command

kubectl migrate simple k8s-node2

host environment

root@k8s-master:~# kubectl get nodes -o wide --show-labels
NAME         STATUS   ROLES    AGE   VERSION                                    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE           KERNEL-VERSION      CONTAINER-RUNTIME      LABELS
k8s-master   Ready    master   33h   v1.19.0-beta.0.1010+a94a66e8033cf4-dirty   11.0.1.136    <none>        Ubuntu 20.04 LTS   5.4.0-26-generic    containerd://Unknown   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
k8s-node1    Ready    <none>   33h   v1.19.0-beta.0.1010+a94a66e8033cf4-dirty   11.0.1.137    <none>        Ubuntu 20.04 LTS   5.4.0-26-generic    containerd://Unknown   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node1,kubernetes.io/os=linux
k8s-node2    Ready    <none>   33h   v1.19.0-beta.0.1010+a94a66e8033cf4-dirty   11.0.1.138    <none>        Ubuntu 20.04 LTS   5.15.0-56-generic   containerd://Unknown   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node2,kubernetes.io/os=linux
root@k8s-node1:~# criu -V
Version: 3.14
root@k8s-node1:~# criu check
Looks good.
root@k8s-node1:~# criu check --all
Warn  (criu/cr-check.c:1230): clone3() with set_tid not supported
Error (criu/cr-check.c:1272): Time namespaces are not supported
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.
root@k8s-node1:~#

root@k8s-node2:~# criu -V
Version: 3.14
root@k8s-node2:~# criu check
Looks good.
root@k8s-node2:~# criu check --all
Looks good.

root@k8s-node1:~# containerd --version
WARN[2022-12-29T10:07:56.161416754+08:00] This customized containerd is only for CI test, DO NOT use it for distribution.
containerd github.com/containerd/containerd e5ffc7a4-TEST

root@k8s-node2:~# containerd --version
WARN[2022-12-29T10:07:42.305625776+08:00] This customized containerd is only for CI test, DO NOT use it for distribution.
containerd github.com/containerd/containerd e5ffc7a4-TEST

Problem Summary and Conjectures

I think there is a problem with the connection between criu and containerd, because there is a problem with the resore process。

I really need your help, thank you! !

Thanks, PaperDragon

Paper-Dragon commented 1 year ago

I reboot the worker node and it was up and running, thank you everyone!