canonical / charmed-openstack-upgrader

Automatic upgrade tool for Charmed Openstack
Apache License 2.0
3 stars 12 forks source link

[Bug] COU stuck to generate plan. model.run_action stuck #546

Closed jneo8 closed 1 month ago

jneo8 commented 2 months ago

Issue description

COU generate data plan step stuck forever after we counter error on verify_cloud step.

Everything working fine if I manually remove verify_cloud step

On generate data plane step, the juju.unit.run_action function gets stuck if the juju.model.wait_for_idle function raise an exception. I suspect this issue relates to the asyncio package.

The exception raise on juju.model.wait_for_idle is relate to networking issue, which can't be avoid.

Why this not happen before?

Prior to commit 4960e6e4ba7d79969f95535402d47d5d28624799, wait_for_idle was never called before generating the plan and we make COU failed every time when wait_for_idle failed.

A simple code to reproduce the error:

import asyncio
import logging
from juju.model import Model

logging.basicConfig(
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    level=logging.INFO  # Global log level
)
logging.getLogger('requests').setLevel(logging.INFO)

async def main():
    model = Model(max_frame_size=2**30)
    async with model as conn:
        try:
            # Open more than 30 tasks to wait for idle
            # This will cause the exception
            # websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received
            tasks = [
                model.wait_for_idle(
                    apps=[app],
                    wait_for_at_least_units=0,
                )
                for app in model.applications
            ]
            await asyncio.gather(*tasks)
        except Exception as err:
            pass
            # logging.error(err)

        app = model.applications.get("nova-compute")
        unit = app.units[0]
        # Here will raise asyncio.exceptions.CancelledError
        count = await unit.run_action("instance-count")
        logging.info(count)

if __name__ == "__main__":
    asyncio.run(main())

And the error message is:

Traceback (most recent call last):
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 953, in transfer_data
    message = await self.read_message()
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1023, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1098, in read_data_frame
    frame = await self.read_frame(max_size)
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1155, in read_frame
    frame = await Frame.read(
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/websockets/legacy/framing.py", line 100, in read
    data = await reader(length)
  File "/home/jneo8/.rye/py/cpython@3.10.13/install/lib/python3.10/asyncio/streams.py", line 708, in readexactly
    await self._wait_for_data('readexactly')
  File "/home/jneo8/.rye/py/cpython@3.10.13/install/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/./main.py", line 38, in <module>
    asyncio.run(main())
  File "/home/jneo8/.rye/py/cpython@3.10.13/install/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/jneo8/.rye/py/cpython@3.10.13/install/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/./main.py", line 33, in main
    count = await unit.run_action("instance-count")
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/juju/unit.py", line 292, in run_action
    res = await op(actions=[client.Action(
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/juju/client/facade.py", line 486, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/juju/client/_client7.py", line 313, in EnqueueOperation
    reply = await self.rpc(msg)
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/juju/client/facade.py", line 659, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/juju/client/connection.py", line 649, in rpc
    await self._ws.send(outgoing)
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 628, in send
    await self.ensure_open()
  File "/home/jneo8/CanonicalProjects/charmed-openstack-upgrader/.venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 929, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received
jneo8 commented 1 month ago

Update description

jneo8 commented 1 month ago

This error happen only you failed on wait for idle then retry pass, so it's hard to reproduce.

But after the wait_for_idle tuning PR #573 merged, we should not encounter websockets.exceptions.ConnectionClosedError anymore. So we can resolve this one after #573 merged.

jneo8 commented 1 month ago

Since #573 is merged. Close this issue as resolved.