aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

Add check and stop network packet loss implementation #4354

Closed tshan2001 closed 1 month ago

tshan2001 commented 1 month ago

Summary

Adding implementation for StopNetworkPacketLoss() and CheckNetworkPacketLoss(), as well as corresponding unit tests.

Implementation details

Check packet loss implementation has already been introduced in the start-network-packet-loss API implementation. We're only adding the helper methods to the API call.

For stop packet loss, we first check whether there already exists a network packet loss fault. If yes, we will run the following command to stop it:

tc qdisc del dev <interfaceName> parent 1:1 handle 10:
tc filter del dev <interfaceName> prio 1
tc qdisc del dev <interfaceName> root handle 1: prio

Testing

Unit tests for the TMDS package was run.

 % go test -tags unit -v -run TestCheckNetworkPacketLoss /workplace/tianzes/amazon-ecs-agent/ecs-agent/tmds/handlers/fault/v1/handlers
...
--- PASS: TestCheckNetworkPacketLoss (0.00s)

 % go test -tags unit -v -run TestStopNetworkPacketLoss /workplace/tianzes/amazon-ecs-agent/ecs-agent/tmds/handlers/fault/v1/handlers
...
--- PASS: TestStopNetworkPacketLoss (0.00s)

New tests cover the changes: Besides existing test cases for the handler, also added the following cases specifically for the start endpoint:

  1. When there doesn't exist a fault on the instance
  2. When there exists a packet loss fault
  3. When there exists a latency fault
  4. When the request contains an unknown field but the rest of the payload is proper.

Manual Testing

Now that we have implementation for all 3 packet loss APIs, we can test the complete workflow of start, check, and stop. Launched a Fargate Instance with the changes. Launched a task with ecs-exec enabled.

# First curl the check packet loss endpoint. Since we haven't injected anything yet, response should be not running
 % curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.0.1"]}'
{"Status":"not-running"}

# Now curl the start endpoint
 % curl -X PUT ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.0.1", "10.1.1.1", "25.168.10.2"]}'
{"Status":"running"}

# Curl the check endpoint again to confirm that the fault was running. Note that the IP address in the check payload is different from the one in the start payload. This shouldn't matter because we won't use it. But if we agree that we won't need it we should take an AI to remove it from the check and stop payload.
 % curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.0.1"]}'
{"Status":"running"}

# Curl the stop endpoint with arbitrary IP address
 % curl -X DELETE ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.0.1"]}'
{"Status":"stopped"}

# Finally, curl the check endpoint again. The result should be not-running
 % curl -X GET ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.0.1"]}'
{"Status":"not-running"}

Additional manual testing:

# Start a task, ecs-exec into the container, and inject 50% packet loss to 8.8.8.8
sh-5.2# curl -X PUT ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":50, "Sources":["8.8.8.8"]}'

# From the task container, ping 8.8.8.8, let it run for 30 seconds, and manually interrupt to see the stats.
sh-5.2# ping 8.8.8.8 -D
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
...
^C
--- 8.8.8.8 ping statistics ---
61 packets transmitted, 22 received, 63.9344% packet loss, time 60826ms
rtt min/avg/max/mdev = 7.932/8.042/8.892/0.189 ms

# We can see that the packet loss has been started as expected.

Description for the changelog

Add check and stop network packet loss implementation

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

**Does this PR include the addition of new environment variables in the README?**

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.