This PR will introduce both start and stop network black hole port fault injection into the ecs-agent directory. It does so by making iptables commands via os/exec.
Implementation details
We will be adding two new functions, startNetworkBlackholePort() and stopNetworkBlackHolePort(), into the ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go file.
startNetworkBlackholePort(): This function is responsible for starting/injecting a new network black hole port with the specified traffic type, protocol, and port number that's passed from the request body. It is called in StartNetworkBlackholePort(). The general workflow of this function is as followed:
Checks if there's not a already running chain with the specified protocol and port number already via checkNetworkBlackHolePort()
Creates a new chain via iptables -N <chain> (the chain name is in the form of "--")
Appends a new rule to the newly created chain via iptables -A <chain> -p <protocol> --dport <port> -j DROP
Inserts the newly created chain into the built-in INPUT/OUTPUT table
stopNetworkBlackHolePort(): This function is responsible for stopping a specific network black hole port with the specified traffic type, protocol, and port number that's passed from the request body. It is called in StopNetworkBlackHolePort(). The general workflow of this function is as followed:
Checks if there's a running chain with the specified protocol and port number via checkNetworkBlackHolePort()
Clears all rules within the specific chain via iptables -F <chain>
Removes the specific chain from the built-in INPUT/OUTPUT table via iptables -D <INPUT/OUTPUT> -j <chain>
Deletes the specific chain via iptables -X <chain>
Similar to CheckNetworkBlackHolePort(), both StartNetworkBlackholePort() and StopNetworkBlackHolePort() handler functions will also have the following checks before responding back to the request.:
If either startNetworkBlackholePort() and stopNetworkBlackHolePort() takes too long to finish then we will respond back with a 500 + "request timed out" error message.
If there were any errors when running any of the the iptables commands in startNetworkBlackholePort() and stopNetworkBlackHolePort() then we will respond back with a 500 + whatever the standard error was from the iptables commands
Testing
New unit test cases were added to generateStartBlackHolePortFaultTestCases and generateStopBlackHolePortFaultTestCases with mock exec expectation calls. Existing test cases now also have the correct mock exec expectation calls.
Renamed generateNetworkBlackHolePortTestCases to generateCommonNetworkBlackHolePortTestCases
This PR will also refactor the existing tests in agent/handlers/task_server_setup_test.go to just test whether or not we can make successful requests to each of the fault injection TMDS endpoints. The deleted tests are also already tested in ecs-agent/tmds/handlers/fault/v1/handlers/handlers_test.go already.
Manual Testing:
Hooked up the fault injection handlers to also register upon TMDS server start up, ran a AWSVPC task that calls all three BHP endpoints (start -> check status -> stop BHP fault)
level=debug time=2024-09-20T00:56:51Z msg="Handling http request" method="PUT" from="169.254.172.2:42200"
level=info time=2024-09-20T00:56:51Z msg="Received new request for request type: start network-blackhole-port" request="{\"Protocol\":\"tcp\",\"TrafficType\":\"egress\",\"Port\":1234}" requestType="start network-blackhole-port" tmdsEndpointContainerID="f4645575-7c7f-49b9-b605-38854d1f1775"
level=info time=2024-09-20T00:56:51Z msg="[INFO] Black hole port fault is not running" netns="/host/proc/25803/ns/net" command="nsenter --net=/host/proc/25803/ns/net iptables -C egress-tcp-1234 -p tcp --dport 1234 -j DROP" output="iptables: Bad rule (does a matching rule exist in that chain?).\n" exitCode=1
level=info time=2024-09-20T00:56:51Z msg="[INFO] Attempting to start network black hole port fault" netns="/host/proc/25803/ns/net" chain="egress-tcp-1234"
level=info time=2024-09-20T00:56:51Z msg="Successfully started fault" requestType="start network-blackhole-port" request="{\"Port\":1234,\"Protocol\":\"tcp\",\"TrafficType\":\"egress\"}" response="{\"Status\":\"running\"}"
level=debug time=2024-09-20T00:57:00Z msg="Storage stats not reported for container" module=utils_unix.go
level=debug time=2024-09-20T00:57:01Z msg="Handling http request" method="GET" from="169.254.172.2:59142"
level=info time=2024-09-20T00:57:01Z msg="Received new request for request type: check status network-blackhole-port" requestType="check status network-blackhole-port" tmdsEndpointContainerID="f4645575-7c7f-49b9-b605-38854d1f1775" request="{\"Protocol\":\"tcp\",\"TrafficType\":\"egress\",\"Port\":1234}"
level=debug time=2024-09-20T00:57:01Z msg="Successfully parsed fault request payload" request="{\"Port\":1234,\"Protocol\":\"tcp\",\"TrafficType\":\"egress\"}"
level=info time=2024-09-20T00:57:01Z msg="[INFO] Black hole port fault has been found running" netns="/host/proc/25803/ns/net" command="nsenter --net=/host/proc/25803/ns/net iptables -C egress-tcp-1234 -p tcp --dport 1234 -j DROP" output=""
level=info time=2024-09-20T00:57:01Z msg="[INFO] Successfully checked status for fault" requestType="check status network-blackhole-port" request="{\"Port\":1234,\"Protocol\":\"tcp\",\"TrafficType\":\"egress\"}" response="{\"Status\":\"running\"}"
level=debug time=2024-09-20T00:57:05Z msg="Received message of type: HeartbeatMessage"
level=debug time=2024-09-20T00:57:05Z msg="ACS activity occurred"
level=debug time=2024-09-20T00:57:05Z msg="Sending response to ACS" Name="heartbeat message responder" Response={
MessageId: "fd8a0b80-f7e0-41a9-82bd-8d20450c03fa"
}
level=debug time=2024-09-20T00:58:01Z msg="Handling http request" method="DELETE" from="169.254.172.2:52668"
level=info time=2024-09-20T00:58:01Z msg="Received new request for request type: stop network-blackhole-port" request="{\"Protocol\":\"tcp\",\"TrafficType\":\"egress\",\"Port\":1234}" requestType="stop network-blackhole-port" tmdsEndpointContainerID="f4645575-7c7f-49b9-b605-38854d1f1775"
level=debug time=2024-09-20T00:58:01Z msg="Successfully parsed fault request payload" request="{\"Port\":1234,\"Protocol\":\"tcp\",\"TrafficType\":\"egress\"}"
level=info time=2024-09-20T00:58:01Z msg="[INFO] Black hole port fault has been found running" netns="/host/proc/25803/ns/net" command="nsenter --net=/host/proc/25803/ns/net iptables -C egress-tcp-1234 -p tcp --dport 1234 -j DROP" output=""
level=info time=2024-09-20T00:58:01Z msg="[INFO] Attempting to stop network black hole port fault" netns="/host/proc/25803/ns/net" chain="egress-tcp-1234"
level=info time=2024-09-20T00:58:01Z msg="Successfully stopped fault" request="{\"Port\":1234,\"Protocol\":\"tcp\",\"TrafficType\":\"egress\"}" response="{\"Status\":\"stopped\"}" requestType="stop network-blackhole-port"
Corresponding iptables output in task ENI/network namespace
Summary
This PR will introduce both start and stop network black hole port fault injection into the
ecs-agent
directory. It does so by makingiptables
commands viaos/exec
.Implementation details
We will be adding two new functions,
startNetworkBlackholePort()
andstopNetworkBlackHolePort()
, into theecs-agent/tmds/handlers/fault/v1/handlers/handlers.go
file.startNetworkBlackholePort()
: This function is responsible for starting/injecting a new network black hole port with the specified traffic type, protocol, and port number that's passed from the request body. It is called inStartNetworkBlackholePort()
. The general workflow of this function is as followed:iptables -N <chain>
(the chain name is in the form of "iptables -A <chain> -p <protocol> --dport <port> -j DROP
stopNetworkBlackHolePort()
: This function is responsible for stopping a specific network black hole port with the specified traffic type, protocol, and port number that's passed from the request body. It is called inStopNetworkBlackHolePort()
. The general workflow of this function is as followed:iptables -F <chain>
iptables -D <INPUT/OUTPUT> -j <chain>
iptables -X <chain>
Similar to
CheckNetworkBlackHolePort()
, bothStartNetworkBlackholePort()
andStopNetworkBlackHolePort()
handler functions will also have the following checks before responding back to the request.:startNetworkBlackholePort()
andstopNetworkBlackHolePort()
takes too long to finish then we will respond back with a 500 + "request timed out" error message.iptables
commands instartNetworkBlackholePort()
andstopNetworkBlackHolePort()
then we will respond back with a 500 + whatever the standard error was from theiptables
commandsTesting
generateStartBlackHolePortFaultTestCases
andgenerateStopBlackHolePortFaultTestCases
with mock exec expectation calls. Existing test cases now also have the correct mock exec expectation calls.generateNetworkBlackHolePortTestCases
togenerateCommonNetworkBlackHolePortTestCases
agent/handlers/task_server_setup_test.go
to just test whether or not we can make successful requests to each of the fault injection TMDS endpoints. The deleted tests are also already tested inecs-agent/tmds/handlers/fault/v1/handlers/handlers_test.go
already.Manual Testing: Hooked up the fault injection handlers to also register upon TMDS server start up, ran a AWSVPC task that calls all three BHP endpoints (start -> check status -> stop BHP fault)
Corresponding iptables output in task ENI/network namespace
Same test but using Host mode task
Corresponding iptables on host network namespace
New tests cover the changes: yes
Description for the changelog
Feature: Adding start and stop network black hole port fault implementation
Additional Information
Does this PR include breaking model changes? If so, Have you added transformation functions?
**Does this PR include the addition of new environment variables in the README?**Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.