cnti-testcatalog / testsuite

šŸ“žšŸ“±ā˜ŽļøšŸ“”šŸŒ Cloud Native Telecom Initiative (CNTI) Test Catalog is a tool to check for and provide feedback on the use of K8s + cloud native best practices in networking applications and platforms
https://wiki.lfnetworking.org/display/LN/Test+Catalog
Apache License 2.0
169 stars 70 forks source link

[BUG] shared_database2 spec test has inconsistent results in actions #2078

Closed svteb closed 3 days ago

svteb commented 1 week ago

Describe the bug The new runners seem to have issues with the shared_database2 spec test.

Passing run from GitHub actions: https://github.com/cnti-testcatalog/testsuite/actions/runs/9439225077/job/26011260848

Failing run from GitHub actions: https://github.com/cnti-testcatalog/testsuite/actions/runs/9439225077/job/26009846416

The issue stems from the Netstat::K8s.get_multiple_pods_connected_to_mariadb_violators function, which should return IPs of two WordPress CNFs connected to the shared MariaDB. By searching for violators: in the logs, it can be seen that sometimes one of the WordPress pod IPs is not returned.

Delving deeper into the code, we can spot that the function self.get_pod_network_info_from_node_via_container_id in lib/k8s_netstat is responsible for detecting the database connections through this block of code:

# get multiple call for a larger sample
parsed_netstat = (1..10).map {
    sleep 10
    netstat = ClusterTools.exec_by_node("nsenter -t #{pid} -n netstat -n", node_name)
    Log.info { "Container Netstat: #{netstat}" }
    Netstat.parse(netstat["output"])
}.flatten.compact

Looking at this code, you can probably see that it works in a hit-or-miss manner (hoping to get hits). The netstat command is executed every 10 seconds, hoping to get all the database connections. This lucky behavior obviously does not have to occur (as can be seen in the actions).

To Reproduce

crystal spec --tag shared_database2

Expected behavior

There should be a more consistent approach to detect that a database connection has been made.

Additional context

Possible solutions:

  1. Increase the netstat attempts from 10 to X.
  2. Do a complete overhaul of the detection code by utilizing MariaDB's connection logging (general query log or something else).

I think that for the time being, a quick hack of increasing the netstat attempts could alleviate the needs of GitHub actions.