chaosblade-io / chaosblade

An easy to use and powerful chaos engineering experiment toolkit.(阿里巴巴开源的一款简单易用、功能强大的混沌实验注入工具)
https://chaosblade.io
Apache License 2.0
5.98k stars 948 forks source link

delete pod-network-delay rule will be failure when the pod restart #485

Open bmbbms opened 3 years ago

bmbbms commented 3 years ago

Issue Description

bug report

Describe what happened (or what feature you want)

when i set a network delay rule for a pod, it make pod livness probe failed,and the pod will be restarted. at this time, if i want to delete the network delay rules ,it will be failure ,because the containerId will be changed when the pod restart. actually the network delay rule continue using the origin containerId to delete the pod network delay.

Describe what you expected to happen

so the containerId is not good for the specified rules. we should theck the Identifier's containerId whether changed when delete failure

How to reproduce it (as minimally and precisely as possible)

  1. first deply a network delay for a pod
        Status:
          Exp Statuses:
            Action:  delay
            Res Statuses:
              Id:          b42b0ee218262ce9
              Identifier:  test-testing-dc-k2030/172.20.35.51/reliable-msg-route-5fdc8cc757-hwvdt/reliable-msg-route/18f0b9d032ce
              Kind:        pod
              State:       Success
              Success:     true
            Scope:         pod
            State:         Success
            Success:       true
            Target:        network
          Phase:           Running
        Events:            <none>
  2. make sure the delay can result in the pod live probe failed and restart
    test-testing-dc-k2030         reliable-msg-route-5fdc8cc757-hwvdt               1/1     Running            4          3d      192.168.137.81    172.20.35.51   <none>           <none>
  3. delete the rule

Status: Exp Statuses: Action: delay Error: see resStatus for the error details Res Statuses: Error: Error response from daemon: No such container: 18f0b9d032ce Id: b42b0ee218262ce9 Identifier: test-testing-dc-k2030/172.20.35.51/reliable-msg-route-5fdc8cc757-hwvdt/reliable-msg-route/18f0b9d032ce Kind: pod State: Error Success: false Scope: pod State: Success Success: false Target: network Phase: Destroying


4. if i delete the rule force,actually the delay rules still in the pod

### Tell us your environment
k8s v1.16.15
chaosblade-operator-v0.9.0

### Anything else we need to know?
xcaspar commented 3 years ago

You can set --daemonset-enable=false flag to close sidecar model when deploying chaosblade-operator to solve the problem.

bmbbms commented 3 years ago

i see the default value of this parm is false.

xcaspar commented 3 years ago

You can delete the pod to recover it. I will solve this problem later.

bmbbms commented 3 years ago

actually it will work well when i apply the rule again using --force ,and i will success delete the rule before the pod next restarting . but i think it not a perfect way for doing that,so i report the bug.

yzhang559 commented 3 years ago

@xcaspar I am using chaosblade-operator-v1.3.0 and k8s v1.21.4, still faced with this issue. Would there be any fix on next release or is there any work around to bypass this issue. Thanks.