frequenz-floss / frequenz-sdk-python

Frequenz Python Software Development Kit (SDK)
https://frequenz-floss.github.io/frequenz-sdk-python/
MIT License
13 stars 17 forks source link

Stream reset by remote party when calling `api.metadata()` using IPv6 #1003

Closed shsms closed 1 month ago

shsms commented 1 month ago

What happened?

When connecting to an ipv6 address like grpc://[::1]:8800, the call to initialize fails with:

grpclib.exceptions.StreamTerminatedError: Stream reset by remote party, error_code: 1

The error is produced when trying to retrieve the microgrid metadata.

Click to expand the complete backtrace ``` Traceback (most recent call last): File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/grpclib/client.py", line 368, in recv_initial_metadata headers = await self._stream.recv_headers() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/grpclib/protocol.py", line 342, in recv_headers await self.headers_received.wait() File "/usr/lib/python3.11/asyncio/locks.py", line 213, in wait await fut asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/sahas/code/forecast-actor/main/src/frequenz/app/power_forecast.py", line 92, in main() # pylint: disable=no-value-for-parameter ^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/main/src/frequenz/app/power_forecast.py", line 88, in main asyncio.run(_run(config_path)) File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/main/src/frequenz/app/power_forecast.py", line 49, in _run await _initialize_microgrid() File "/home/sahas/code/forecast-actor/main/src/frequenz/app/power_forecast.py", line 38, in _initialize_microgrid await microgrid.initialize( File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/frequenz/sdk/microgrid/__init__.py", line 252, in initialize await connection_manager.initialize(server_url) File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/frequenz/sdk/microgrid/connection_manager.py", line 193, in initialize await microgrid_api._initialize() # pylint: disable=protected-access ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/frequenz/sdk/microgrid/connection_manager.py", line 162, in _initialize self._metadata = await self._api.metadata() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/frequenz/client/microgrid/_client.py", line 137, in metadata microgrid_metadata = await self.api.get_microgrid_metadata( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/frequenz/microgrid/betterproto/frequenz/api/microgrid/__init__.py", line 279, in get_microgrid_metadata return await self._unary_unary( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/betterproto/grpc/grpclib_client.py", line 85, in _unary_unary response = await stream.recv_message() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/grpclib/client.py", line 425, in recv_message await self.recv_initial_metadata() File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/grpclib/client.py", line 367, in recv_initial_metadata with self._wrapper: File "/home/sahas/code/forecast-actor/venv/lib/python3.11/site-packages/grpclib/utils.py", line 70, in __exit__ raise self._error grpclib.exceptions.StreamTerminatedError: Stream reset by remote party, error_code: 1 ```

This is not an issue when connecting with ipv4 grpc://127.0.0.1:8800.

Both ipv4 and ipv6 worked in SDK 1.0.0-rc601. This is an issue only in rc700.

What did you expect instead?

microgrid.initialize returns successfully when using an ipv6 address.

Affected version(s)

v1.0.0-rc700

Affected part(s)

Microgrid (API, component graph, etc.) (part:microgrid)

Extra information

No response

llucax commented 1 month ago

Having a look. gprclib seems to be forwarding the host to https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.create_connection. I'm trying to find out at which level this is failing.

llucax commented 1 month ago

After quite a bit of investigation, I can't reproduce, at least with this synthetic test:

server.py

import asyncio
from typing import AsyncIterator

from frequenz.microgrid.betterproto.frequenz.api.microgrid import (
    MicrogridBase,
    MicrogridMetadata,
)
from grpclib.server import Server

async def start_server():
    HOST = "::1"
    PORT = 62060
    server = Server([MicrogridBase()])
    await server.start(HOST, PORT)
    print(f"Serving on {HOST}:{PORT}")
    await server.wait_closed()

asyncio.run(start_server())

client.py:

import asyncio
import logging
from datetime import timedelta

from frequenz.channels import merge

from frequenz.sdk import microgrid
from frequenz.sdk.actor import ResamplerConfig

async def main() -> None:
    """Create the battery pool, activate all formulas and listen for any update."""
    logging.basicConfig(
        level=logging.DEBUG, format="%(asctime)s %(name)s %(levelname)s:%(message)s"
    )
    # Reduce undesired noise from hpack
    logging.getLogger("hpack.hpack").setLevel(logging.INFO)

    await microgrid.initialize(
        f"grpc://[::1]:62060",
        resampler_config=ResamplerConfig(resampling_period=timedelta(seconds=1.0)),
    )

    battery_pool = microgrid.new_battery_pool(priority=5)
    receivers = [
        battery_pool.soc.new_receiver(limit=1),
        battery_pool.capacity.new_receiver(limit=1),
        # pylint: disable=protected-access
        battery_pool._system_power_bounds.new_receiver(limit=1),
        # pylint: enable=protected-access
    ]

    async for metric in merge(*receivers):
        print(f"Received new metric: {metric}")

asyncio.run(main())

I get the no implemented errors, but the connection is made without issues.

$ python server.py 
Serving on ::1:62060
$ python client.py
2024-07-16 14:28:30,851 frequenz.sdk.microgrid.connection_manager INFO:Connecting to microgrid at grpc://[::1]:62060
2024-07-16 14:28:30,853 hpack.table DEBUG:Resizing header table to 4096 from 4096
2024-07-16 14:28:30,854 frequenz.client.microgrid._client ERROR:The microgrid metadata is not available.
[...]
    raise ClientError.from_grpc_error(
frequenz.client.microgrid._exception.OperationNotImplemented: Failed calling 'list_components' on 'grpc://[::1]:62060': The operation is not implemented or not supported/enabled in this service <status=UNIMPLEMENTED>
llucax commented 1 month ago

I can now reproduce this using socat to tunnel using IPv6 to the sandbox with this code:

import asyncio
import sys
from datetime import timedelta
from frequenz.sdk import microgrid
from frequenz.sdk.actor import ResamplerConfig

async def main(host: str) -> None:
    try:
        await microgrid.initialize(
            f"grpc://{host}:62060",
            resampler_config=ResamplerConfig(resampling_period=timedelta(seconds=1.0)),
        )
    except Exception as e:
        print(f"Error while initializing microgrid: {e}")

asyncio.run(main(sys.argv[1]))

Works with:

Fails with:

I hope socat is not interfering with the results.

shsms commented 1 month ago

You could confirm by using sdk rc601 to connect to socat's ipv6 endpoint.

llucax commented 1 month ago

I will do that next, but for now, it seems like the issue is related to the server implementation (or the data provided by the server). This test with a server built with grpclib works:

server.py

import asyncio

from betterproto.lib.google.protobuf import Empty
from frequenz.microgrid.betterproto.frequenz.api.common.components import (
    ComponentCategory,
)
from frequenz.microgrid.betterproto.frequenz.api.microgrid import (
    Component,
    ComponentFilter,
    ComponentList,
    Connection,
    ConnectionFilter,
    ConnectionList,
    MicrogridBase,
    MicrogridMetadata,
)
from grpclib.server import Server

class MicrogridService(MicrogridBase):

    async def get_microgrid_metadata(
        self,
        betterproto_lib_google_protobuf_empty: Empty,
    ) -> MicrogridMetadata:
        response = MicrogridMetadata(microgrid_id=1)
        print(f"Returning metadata: {response}")
        return response

    async def list_components(self, component_filter: ComponentFilter) -> ComponentList:
        response = ComponentList(
            components=[
                Component(id=1, category=ComponentCategory.COMPONENT_CATEGORY_GRID),
                Component(id=2, category=ComponentCategory.COMPONENT_CATEGORY_BATTERY),
            ]
        )
        print(f"Returning components: {response}")
        return response

    async def list_connections(
        self, connection_filter: ConnectionFilter
    ) -> ConnectionList:
        response = ConnectionList(connections=[Connection(1, 2)])
        print(f"Returning connections: {response}")
        return response

async def start_server():
    HOST = "::1"
    PORT = 62060
    server = Server([MicrogridService()])
    await server.start(HOST, PORT)
    print(f"Serving on {HOST}:{PORT}")
    await server.wait_closed()

asyncio.run(start_server())

Running python test.py ::1 with the same client works.

llucax commented 1 month ago

You could confirm by using sdk rc601 to connect to socat's ipv6 endpoint.

Yeah, I can confirm rc601 works well with IPv6 and socat.

shsms commented 1 month ago

I will do that next, but for now, it seems like the issue is related to the server implementation (or the data provided by the server). This test with a server built with grpclib works:

Do you have any reason for thinking it is related to the server implementation? If not, I'd think it is more likely to be grpclib. The server implementations use tonic, which is a 3rd party grpc library for rust. And it works with the grpcio library from google on python. Both of these libraries have much more contributors and users than grpclib.

I think a fix has to come from our side, not the server side.

llucax commented 1 month ago

Do you have any reason for thinking it is related to the server implementation? If not, I'd think it is more likely to be grpclib. The server implementations use tonic, which is a 3rd party grpc library for rust. And it works with the grpcio library from google on python. Both of these libraries have much more contributors and users than grpclib.

I think both, I mean, I think is grpclib having issues with a particular server, because as I show in the test above, it does work when implementing a dummy server using grpclib, so grpclib-grpclib seems to work.

I'm not trying to blame the server, I'm just pointing out that the grpclib-tonic seems to be problematic (maybe, maybe it is something else and it only manifest there for some reason). I'm just providing more information to help debugging this and find the root cause. Knowing this, we can compare the data interchanged between the working and not working cases to find out what might be causing the connection reset.

llucax commented 1 month ago

Doing some captures, one difference I can see is tonic seems to be replying with some sort of GRPCHTTP2 protocol, while grpclib uses plain GRPC. Not sure what this means, this is what wireshark reports when dissecting the packages. And using IPv4 is the same (tonic replies with GRPCHTTP2) but it works.

llucax commented 1 month ago

OK, I confirmed this is an issue with grpc when communicating to tonic, by doing some hello world example. I will report upstream.

llucax commented 1 month ago

Let's see:

llucax commented 1 month ago

Marking as blocked as we can't do much until this is fixed upstream...