Experts fail to initialize > 50% of the time

Vectorrent commented 2 months ago

I have no idea why this happens. Even when bootstrapping from a local DHT node, initialization may fail with all kinds of errors:

Sep 04 06:37:01.905 [INFO] Server started with 3 modules:
Sep 04 06:37:01.905 [INFO] expert.0: PraxisMLP, 525568 parameters
Sep 04 06:37:01.905 [INFO] expert.1: PraxisMLP, 525568 parameters
Sep 04 06:37:01.905 [INFO] expert.2: PraxisMLP, 525568 parameters
Sep 04 06:37:01.936 [ERROR] [hivemind.moe.server.connection_handler._run:63] ConnectionHandler failed to start:
Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 86, in bytes_iter
    proto = protocol_with_code(code)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/protocols.py", line 290, in protocol_with_code
    return REGISTRY.find_by_code(code)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/protocols.py", line 260, in find_by_code
    raise exceptions.ProtocolNotFoundError(code, "code")
multiaddr.exceptions.ProtocolNotFoundError: No protocol with code 465 found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/server/connection_handler.py", line 59, in _run
    self._p2p = await self.dht.replicate_p2p()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/dht/dht.py", line 327, in replicate_p2p
    self._p2p_replica = await P2P.replicate(daemon_listen_maddr)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/p2p/p2p_daemon.py", line 312, in replicate
    await self._ping_daemon()
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/p2p/p2p_daemon.py", line 317, in _ping_daemon
    logger.debug(f"Launched p2pd with peer id = {self.peer_id}, host multiaddrs = {self._visible_maddrs}")
                                                                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/multiaddr.py", line 147, in __repr__
    return "<Multiaddr %s>" % str(self)
                              ^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/multiaddr.py", line 135, in __str__
    return bytes_to_string(self._bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 30, in bytes_to_string
    for _, proto, codec, part in bytes_iter(buf):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 89, in bytes_iter
    raise exceptions.BinaryParseError(
multiaddr.exceptions.BinaryParseError: Invalid binary MultiAddr protocol 465: Unknown Protocol
Sep 04 06:37:01.940 [ERROR] [hivemind.utils.mpfuture._process_updates_in_background:198] Could not retrieve update: caught TypeError("BinaryParseError.__init__() missing 2 required positional arguments: 'binary' and 'protocol'") (pid=242958)
Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/utils/mpfuture.py", line 177, in _process_updates_in_background
    uid, update_type, payload = receiver_pipe.recv()
                                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BinaryParseError.__init__() missing 2 required positional arguments: 'binary' and 'protocol'

I was sort of able to make this problem less frequent by adding a delay to startup, but it honestly doesn't work very well, if at all.

Could use some help with this one. I've been running into issues like this in Hivemind for years.

Vectorrent commented 2 months ago

https://github.com/user-attachments/assets/7c97f2ca-3a2e-43f3-8cdb-defa4b9190e0

Example

Vectorrent commented 1 month ago

I found a solution to this problem. Long story short, if you call dht.get_visible_maddrs() before attempting to start the server, it will never hang. Clearly, this is not intended behavior, and this method should have no bearing on server bootstrapping... but it does. So, we fixed it with a hack, until upstream fixes this.

0-5788719150923125 / praxis

Experts fail to initialize > 50% of the time #2