bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
9.27k stars 524 forks source link

[might be bug?] Failed to connect to bootstrap peers when using docker image on truenas scale #511

Open TomLBZ opened 1 year ago

TomLBZ commented 1 year ago
2023-09-16 10:15:31.031074+00:00Sep 16 10:15:31.030 [INFO] Running Petals 2.2.0
2023-09-16 10:15:31.349212+00:00/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
2023-09-16 10:15:31.349257+00:00warnings.warn(
2023-09-16 10:15:33.018599+00:00Downloading (…)lve/main/config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 609/609 [00:00<00:00, 4.34MB/s]
2023-09-16 10:15:33.021225+00:00Sep 16 10:15:33.021 [INFO] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
2023-09-16 10:15:33.021283+00:00Sep 16 10:15:33.021 [INFO] Using DHT prefix: Llama-2-70b-hf
2023-09-16 10:15:33.021870+00:00/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py:485: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
2023-09-16 10:15:33.021890+00:00warnings.warn(
2023-09-16 10:15:43.745860+00:00Traceback (most recent call last):
2023-09-16 10:15:43.745925+00:00File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2023-09-16 10:15:43.746035+00:00return _run_code(code, main_globals, None,
2023-09-16 10:15:43.746086+00:00File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
2023-09-16 10:15:43.746102+00:00exec(code, run_globals)
2023-09-16 10:15:43.746113+00:00File "/home/petals/src/petals/cli/run_server.py", line 235, in <module>
2023-09-16 10:15:43.746186+00:00main()
2023-09-16 10:15:43.746204+00:00File "/home/petals/src/petals/cli/run_server.py", line 219, in main
2023-09-16 10:15:43.746299+00:00server = Server(
2023-09-16 10:15:43.746313+00:00File "/home/petals/src/petals/server/server.py", line 138, in __init__
2023-09-16 10:15:43.746400+00:00is_reachable = check_direct_reachability(initial_peers=initial_peers, use_relay=False, **kwargs)
2023-09-16 10:15:43.746416+00:00File "/home/petals/src/petals/server/reachability.py", line 78, in check_direct_reachability
2023-09-16 10:15:43.746454+00:00return RemoteExpertWorker.run_coroutine(_check_direct_reachability())
2023-09-16 10:15:43.746473+00:00File "/opt/conda/lib/python3.10/site-packages/hivemind/moe/client/remote_expert_worker.py", line 36, in run_coroutine
2023-09-16 10:15:43.751352+00:00return future if return_future else future.result()
2023-09-16 10:15:43.751381+00:00File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
2023-09-16 10:15:43.751782+00:00return self.__get_result()
2023-09-16 10:15:43.751811+00:00File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2023-09-16 10:15:43.751840+00:00raise self._exception
2023-09-16 10:15:43.751873+00:00File "/home/petals/src/petals/server/reachability.py", line 59, in _check_direct_reachability
2023-09-16 10:15:43.751897+00:00target_dht = await DHTNode.create(client_mode=True, **kwargs)
2023-09-16 10:15:43.751907+00:00File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/node.py", line 192, in create
2023-09-16 10:15:43.752325+00:00p2p = await P2P.create(**kwargs)
2023-09-16 10:15:43.752358+00:00File "/opt/conda/lib/python3.10/site-packages/hivemind/p2p/p2p_daemon.py", line 234, in create
2023-09-16 10:15:43.752725+00:00await asyncio.wait_for(ready, startup_timeout)
2023-09-16 10:15:43.752748+00:00File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2023-09-16 10:15:43.753069+00:00return fut.result()
2023-09-16 10:15:43.753083+00:00hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2023/09/16 10:15:43 failed to connect to bootstrap peers

I tried to host the docker container on truenas scale but failed with the error above. Might be a bug?

redcap3000 commented 1 year ago

Having the same problem in linux attempting to connect to a private swarm. File "/home/rcap3/anaconda3/lib/python3.11/site-packages/hivemind/dht/node.py", line 192, in create Sep 17 17:55:02 i7ubuntu python[322820]: p2p = await P2P.create(**kwargs) Sep 17 17:55:02 i7ubuntu python[322820]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ Sep 17 17:55:02 i7ubuntu python[322820]: File "/home/rcap3/anaconda3/lib/python3.11/site-packages/hivemind/p2p/p2p_daemon.py", line 234, in create Sep 17 17:55:02 i7ubuntu python[322820]: await asyncio.wait_for(ready, startup_timeout) Sep 17 17:55:02 i7ubuntu python[322820]: File "/home/rcap3/anaconda3/lib/python3.11/asyncio/tasks.py", line 479, in wait_for Sep 17 17:55:02 i7ubuntu python[322820]: return fut.result() Sep 17 17:55:02 i7ubuntu python[322820]: ^^^^^^^^^^^^ Sep 17 17:55:02 i7ubuntu python[322820]: hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2023/09/17 17:55:02 failed to connect to bootstrap peers

edugamerplay1228 commented 1 year ago

023-09-19 16:43:06.190015: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Sep 19 16:43:07.354 [INFO] Running Petals 2.2.0 /usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Downloading (…)lve/main/config.json: 100% 610/610 [00:00<00:00, 2.74MB/s] Sep 19 16:43:07.883 [INFO] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1 Sep 19 16:43:07.884 [INFO] Using DHT prefix: Llama-2-13b-hf /usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py:485: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/petals/cli/run_server.py", line 235, in main() File "/usr/local/lib/python3.10/dist-packages/petals/cli/run_server.py", line 219, in main server = Server( File "/usr/local/lib/python3.10/dist-packages/petals/server/server.py", line 138, in init is_reachable = check_direct_reachability(initial_peers=initial_peers, use_relay=False, kwargs) File "/usr/local/lib/python3.10/dist-packages/petals/server/reachability.py", line 78, in check_direct_reachability return RemoteExpertWorker.run_coroutine(_check_direct_reachability()) File "/usr/local/lib/python3.10/dist-packages/hivemind/moe/client/remote_expert_worker.py", line 36, in run_coroutine return future if return_future else future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/petals/server/reachability.py", line 59, in _check_direct_reachability target_dht = await DHTNode.create(client_mode=True, kwargs) File "/usr/local/lib/python3.10/dist-packages/hivemind/dht/node.py", line 192, in create p2p = await P2P.create(**kwargs) File "/usr/local/lib/python3.10/dist-packages/hivemind/p2p/p2p_daemon.py", line 234, in create await asyncio.wait_for(ready, startup_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2023/09/19 16:43:13 failed to connect to bootstrap peers

borzunov commented 1 year ago

Hi @TomLBZ @redcap3000 @edugamerplay1228,

This may be an issue with DNS/IPv6 addresses present among the default bootstrap peers. Can you please try again with this option (this uses IPv4 addresses only)?

--initial_peers /ip4/159.89.214.152/tcp/31337/p2p/QmedTaZXmULqwspJXz44SsPZyTNKxhnnFvYRajfH7MGhCY /ip4/159.203.156.48/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5

hrQAQ commented 1 year ago

Hello @borzunov ,

I encountered a similar issue on Windows with WSL2 while attempting to connect to my own private swarm backbone. I have two hosts connected within the local area network and the error log is totally the same as this. Following your advice, I used this argument:

--initial_peers /ip4/159.89.214.152/tcp/31337/p2p/QmedTaZXmULqwspJXz44SsPZyTNKxhnnFvYRajfH7MGhCY /ip4/159.203.156.48/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5

This successfully connected the private swarm. However, I have encountered severe network throughput degradation with the private swarm backbone you provided. So I am curious about how to directly solve this problem instead of using public initial_peers.