aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 613 forks source link

Add start network packet loss implementation #4344

Closed tshan2001 closed 2 months ago

tshan2001 commented 2 months ago

Summary

Adding implementation for StartNetworkPacketLoss() and corresponding unit tests. Also updating the os/exec wrapper to include 2 more helper methods to facilitate mocks in unit testing.

The unit test syntax is sampled from this pending PR: https://github.com/aws/amazon-ecs-agent/pull/4330/files.

Implementation details

When the start network-packet-loss endpoint is invoked, we will first check to see if there already exists a latency/packet loss fault, if yes, return code 409 to indicate the conflict. Otherwise, the following command will be called to start the fault:

<nsenterPrefix> tc qdisc add dev <interfaceName> root handle 1: prio priomap 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
<nsenterPrefix> "tc qdisc add dev <interfaceName> parent 1:1 handle 10: netem loss <lossPercentage>%"

Testing

Unit tests for the TMDS package was run.

 % go test -tags unit -v -run TestStartNetworkPacketLoss /workplace/tianzes/amazon-ecs-agent/ecs-agent/tmds/handlers/fault/v1/handlers
...
--- PASS: TestStartNetworkPacketLoss (0.00s)
    --- PASS: TestStartNetworkPacketLoss/no-existing-fault (0.00s)
    --- PASS: TestStartNetworkPacketLoss/existing-network-latency-fault (0.00s)
    --- PASS: TestStartNetworkPacketLoss/existing-network-packet-loss-fault (0.00s)
    --- PASS: TestStartNetworkPacketLoss/unknown-request-body-no-existing-fault (0.00s)
    --- PASS: TestStartNetworkPacketLoss/failed-to-unmarshal-json (0.00s)
    --- PASS: TestStartNetworkPacketLoss/os/exec-times-out (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_malformed_request_body_1 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_malformed_request_body_2 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_incomplete_request_body_1 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_incomplete_request_body_2 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_incomplete_request_body_3 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_invalid_LossPercent_in_the_request_body_1 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_invalid_LossPercent_in_the_request_body_2 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_invalid_LossPercent_in_the_request_body_3 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_invalid_IP_value_in_the_request_body_1 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_invalid_IP_CIDR_block_value_in_the_request_body_2 (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_task_lookup_fail (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_task_metadata_fetch_fail (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_task_metadata_unknown_fail (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_fault_injection_disabled (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_invalid_network_mode (0.00s)
    --- PASS: TestStartNetworkPacketLoss/start_network-packet-loss_empty_task_network_config (0.00s)
PASS

New tests cover the changes: Besides existing test cases for the handler, also added the following cases specifically for the start endpoint:

  1. When there doesn't exist a fault on the instance
  2. When there exists a packet loss fault
  3. When there exists a latency fault
  4. When the json is malformed
  5. When the 5 second context times out

Manual Testing

Launched a Fargate Instance with the changes. Launched a task with ecs-exec enabled.

# Log into the instance as 'su', manually run the tc command to check if there's existing fault:
% nsenter --net=/var/run/netns/f51cb03b1b68477899424be2256218d4-02362b26ecd3 tc q show dev eth1 parent 1:1
result was empty

# Now ecs-exec into the container, curl the start endpoint with the following:
% curl -X PUT \
${ECS_AGENT_URI}/fault/v1/network-packet-loss \
--data '{"lossPercent":6, "Sources":["192.168.0.1"]}'

# And check from the instance again:
% nsenter --net=/var/run/netns/f51cb03b1b68477899424be2256218d4-02362b26ecd3 tc q show dev eth1 parent 1:1
qdisc netem 10: parent 1:1 limit 1000 loss 6%

# Now curl the endpoint again, an error should be returned since we've already injected a fault.
% curl -X PUT ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.1.1"]}'
{"Error":"There is already one network packet loss fault running"}

# Edge case: manually add a latency fault, and call the start packet loss endpoint. An error should be returned as well:
% nsenter --net=/var/run/netns/f51cb03b1b68477899424be2256218d4-02362b26ecd3 tc qdisc add dev eth1 parent 1:1 handle 10: netem delay 100ms 10ms

% curl -X PUT ${ECS_AGENT_URI}/fault/v1/network-packet-loss --data '{"lossPercent":6, "Sources":["192.168.1.1"]}'
{"Error":"There is already one network latency fault running"}

Description for the changelog

Add start network packet loss implementation

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

**Does this PR include the addition of new environment variables in the README?**

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.