cloudfoundry / cf-acceptance-tests

CF Acceptance tests
Apache License 2.0
69 stars 173 forks source link

Flaky test: [tcp routing] TCP Routing external ports with a second external port [It] maps both ports to the same application #1173

Open jochenehret opened 3 months ago

jochenehret commented 3 months ago

The TCP Routing test that checks if one app can be reached from two ports is failing often here: https://github.com/cloudfoundry/cf-acceptance-tests/blob/6f060209f7a55f0c4f8d0fffabb122c785ce914e/tcp_routing/tcp_routing.go#L131

Example failures: https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/cf-deployment/jobs/fips-cats/builds/82 https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/cf-deployment/jobs/fips-cats/builds/57 https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/cf-deployment/jobs/fips-cats/builds/113 https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/cf-deployment/jobs/fips-cats/builds/120

I've recreated the test setup manually on fips/snape. The setup works as expected: You can send data over two different TCP ports to the test app and the app responds as expected. Running the test in the CATs suite however fails often.

I've added some debug statements with timestamps. Here's the flow from a failed run:

# sending first test message to first port
# https://github.com/cloudfoundry/cf-acceptance-tests/blob/6f060209f7a55f0c4f8d0fffabb122c785ce914e/cats_suite_helpers/cats_suite_helpers.go#L406
starting SendAndReceive(tcp.cf.snape.env.wg-ard.ci.cloudfoundry.org, 1031) at Jul 11 14:49:45.862

# output from test app: https://github.com/cloudfoundry/cf-acceptance-tests/blob/6f060209f7a55f0c4f8d0fffabb122c785ce914e/assets/tcp-listener/main.go#L53
# "10.0.32.11" is one of the two tcp-routers
2024-07-11T12:49:45.97+0000 [APP/PROC/WEB/0] OUT Message to 10.0.32.11:41084: server1:Time is 938260798
2024-07-11T12:49:45.99+0000 [APP/PROC/WEB/0] OUT Jul 11 14:49:45.991 (read) Closing connection to 10.0.32.11:41084: EOF

# sending second test message to other port
starting SendAndReceive(tcp.cf.snape.env.wg-ard.ci.cloudfoundry.org, 1026) at Jul 11 14:49:45.955

# now we are failing here when reading the response:
# https://github.com/cloudfoundry/cf-acceptance-tests/blob/6f060209f7a55f0c4f8d0fffabb122c785ce914e/cats_suite_helpers/cats_suite_helpers.go#L437
Jul 11 14:54:46.575 error3: EOF
buff is:

When the second message is sent, the conn.Write(message) statement returns no error: https://github.com/cloudfoundry/cf-acceptance-tests/blob/6f060209f7a55f0c4f8d0fffabb122c785ce914e/cats_suite_helpers/cats_suite_helpers.go#L417 However, the test app doesn't seem to receive the message. There is no "Message to" log statement. What happens next is an error at the conn.Read(buff) statement: https://github.com/cloudfoundry/cf-acceptance-tests/blob/6f060209f7a55f0c4f8d0fffabb122c785ce914e/cats_suite_helpers/cats_suite_helpers.go#L429 Error is "EOF" and the buffer is empty.

Looks like a race condition. The Read function is probably called before the test app starts to write and fails immediately with EOF?

jochenehret commented 3 months ago

PR was merged 5 days ago. So far no failures. Let's observe a few more days before we close this issue.

jochenehret commented 3 months ago

Failing again, multiple times in a row: https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/cf-deployment/jobs/fips-cats/builds/162 https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/cf-deployment/jobs/fips-cats/builds/163 https://concourse.wg-ard.ci.cloudfoundry.org/teams/main/pipelines/cf-deployment/jobs/fips-cats/builds/164

Looks like our connection handling is not working correctly...