canonical / knative-operators

Charmed Knative Operators
Apache License 2.0
1 stars 2 forks source link

Knative integration test fail timeout on request to gateway #146

Closed orfeas-k closed 1 year ago

orfeas-k commented 1 year ago

During PR https://github.com/canonical/knative-operators/pull/143, the CI failed after it was merged (succeeded on PR).

I see this error

OSError: [Errno 9] Bad file descriptor
ERROR    juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
WARNING  juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
ERROR    asyncio:base_events.py:1707 Task exception was never retrieved
Full logs ``` -------------------------------- live log setup -------------------------------- INFO test_bundle:test_bundle.py:133 Waiting on ksvc to exist INFO test_bundle:test_bundle.py:136 Checking for ksvc CRD INFO httpx:_client.py:1013 HTTP Request: GET https://10.1.0.20:16443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/services.serving.knative.dev "HTTP/1.1 200 OK" -------------------------------- live log call --------------------------------- INFO httpx:_client.py:1013 HTTP Request: POST https://10.1.0.20:16443/apis/serving.knative.dev/v1/namespaces/knative-test/services "HTTP/1.1 201 Created" INFO test_bundle:test_bundle.py:163 Checking ksvc status INFO httpx:_client.py:1013 HTTP Request: GET https://10.1.0.20:16443/apis/serving.knative.dev/v1/namespaces/knative-test/services/cloudevents-player "HTTP/1.1 200 OK" INFO test_bundle:test_bundle.py:170 Waiting on ksvc for status to be available INFO test_bundle:test_bundle.py:163 Checking ksvc status INFO httpx:_client.py:1013 HTTP Request: GET https://10.1.0.20:16443/apis/serving.knative.dev/v1/namespaces/knative-test/services/cloudevents-player "HTTP/1.1 200 OK" INFO test_bundle:test_bundle.py:174 Waiting on ksvc with conditions [] to be Ready INFO test_bundle:test_bundle.py:163 Checking ksvc status INFO httpx:_client.py:1013 HTTP Request: GET https://10.1.0.20:16[443](https://github.com/canonical/knative-operators/actions/runs/6041554253/job/16394758254#step:4:444)/apis/serving.knative.dev/v1/namespaces/knative-test/services/cloudevents-player "HTTP/1.1 200 OK" INFO test_bundle:test_bundle.py:174 Waiting on ksvc with conditions [{'lastTransitionTime': '2023-08-31T19:42:51Z', 'status': 'True', 'type': 'ConfigurationsReady'}, {'lastTransitionTime': '2023-08-31T19:42:51Z', 'status': 'True', 'type': 'Ready'}, {'lastTransitionTime': '2023-08-31T19:42:51Z', 'status': 'True', 'type': 'RoutesReady'}] to be Ready INFO httpx:_client.py:1013 HTTP Request: GET https://10.1.0.20:16443/api/v1/namespaces/knative-test/services/istio-ingressgateway-workload "HTTP/1.1 200 OK" FAILED ------------------------------ live log teardown ------------------------------- INFO httpx:_client.py:1013 HTTP Request: DELETE https://10.1.0.20:16443/apis/serving.knative.dev/v1/namespaces/knative-test/services/cloudevents-player "HTTP/1.1 200 OK" tests/test_bundle.py::test_eventing_custom_image -------------------------------- live log setup -------------------------------- WARNING juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting ERROR asyncio:base_events.py:1707 Task exception was never retrieved future: exception=OSError(9, 'Bad file descriptor')> Traceback (most recent call last): File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect res = await connector( File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login await self._connect(endpoints) File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect result = await task File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one return f.result() # May raise f.exception(). File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint return await self._open(endpoint, cacert) File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 402, in _open return (await websockets.connect( File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__ transport, protocol = await self._creating_connection File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection transport, protocol = await self._create_connection_transport( File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport sock.setblocking(False) OSError: [Errno 9] Bad file descriptor ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed WARNING juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting ERROR asyncio:base_events.py:1707 Task exception was never retrieved future: exception=OSError(9, 'Bad file descriptor')> Traceback (most recent call last): File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect res = await connector( File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login await self._connect(endpoints) File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect result = await task File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one return f.result() # May raise f.exception(). File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint return await self._open(endpoint, cacert) File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py", line 402, in _open return (await websockets.connect( File "/home/runner/work/knative-operators/knative-operators/.tox/integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__ transport, protocol = await self._creating_connection File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection transport, protocol = await self._create_connection_transport( File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport sock.setblocking(False) OSError: [Errno 9] Bad file descriptor ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed ```

We 've seen this in other repos as well https://github.com/canonical/seldon-core-operator/issues/203.

orfeas-k commented 1 year ago

In the related issue, @i-chvets mentions that:

That error - Bad file descriptior - is an effect of tear down of test environment. Test environment is being removed when tests are finished. In this case integration tests are failing, most likely due resources. This log is the real reason for this issue.

But I don't see any logs in this case indicating an error

orfeas-k commented 1 year ago

Last live log call is

INFO     test_bundle:test_bundle.py:174 Waiting on ksvc with conditions [{'lastTransitionTime': '2023-08-31T19:42:51Z', 'status': 'True', 'type': 'ConfigurationsReady'}, {'lastTransitionTime': '2023-08-31T19:42:51Z', 'status': 'True', 'type': 'Ready'}, {'lastTransitionTime': '2023-08-31T19:42:51Z', 'status': 'True', 'type': 'RoutesReady'}] to be Ready
INFO     httpx:_client.py:1013 HTTP Request: GET https://10.1.0.20:16443/api/v1/namespaces/knative-test/services/istio-ingressgateway-workload "HTTP/1.1 200 OK"
FAILED

which means that failure happened somewhere around here

NohaIhab commented 1 year ago

Conclusion based on debugging session:

the test_cloud_events_player_example test was hanging at the line: post_req = requests.post(url, json=data, headers=headers, allow_redirects=False, verify=False) the url being f"http://cloudevents-player.{ops_test.model_name}.{gateway_ip}.nip.io"

Checking the pods in the cluster, the istio-ingressgateway workload container was not Ready:

knative-test                          istio-ingressgateway-workload-7488f689f4-mzspz   0/1     Running   0

This was a clue as to why we cannot reach the gateway from the test. Through further debugging, it proved to be a problem with istio, filed in https://github.com/canonical/istio-operators/issues/325, check the issue for details.

PR https://github.com/canonical/istio-operators/pull/326 was sent to fix the issue in istio. Once that is merged, we should be able to re-run and pass the knative integration tests.