bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
8.9k stars 490 forks source link

Swarm balancing logic issues #389

Open fadenb opened 11 months ago

fadenb commented 11 months ago

Hey 👋,

I am opening this issue to discuss the current swarm balancing approach.

Recently I have seen that the public swarm hosting enoch/llama-65b-hf is unbalanced. This by itself is not a surprise nor a problem. The issue is then remediated by the server loading other blocks. All good so far.

Today I noticed that my server is loading the same blocks it had before. As the loading process is quite slow (often around 10 minutes), this basically takes away the compute capacity of that server from the swarm for 10 minutes without providing any benefit.

A log excerpt might explain the situation better: Notice that [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] is loaded initially and also the exact same [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] to rebalance it.

Jul 20 12:08:47.880 [INFO] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
Jul 20 12:08:47.880 [INFO] Using DHT prefix: llama-65b-hf
Jul 20 12:08:57.909 [INFO] This server is accessible directly
Jul 20 12:09:02.623 [INFO] Connecting to the public swarm
Jul 20 12:09:02.624 [INFO] Running a server on ['/ip4/172.17.0.2/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf', '/ip4/127.0.0.1/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf', '/ip4/147.189.193.61/tcp/31330/p2p/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf']
Jul 20 12:09:02.646 [INFO] Model weights are loaded in float16, quantized to nf4 format
Jul 20 12:09:02.647 [INFO] Attention cache for all blocks will consume up to 1.25 GiB
Jul 20 12:09:02.648 [INFO] Loading throughput info
Jul 20 12:09:02.684 [INFO] Reporting throughput: 2203.3 RPS for 20 blocks
Jul 20 12:09:04.430 [INFO] Reachability service started
Jul 20 12:09:08.345 [INFO] Announced that blocks [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] are joining
Jul 20 12:09:15.051 [INFO] Loaded enoch/llama-65b-hf block 60, <All keys matched successfully>
Downloading (…)/adapter_config.json: 100%|██████████| 425/425 [00:00<00:00, 2.09MB/s]
Downloading (…)er_model.safetensors: 100%|██████████| 3.20G/3.20G [00:54<00:00, 58.3MB/s]
Jul 20 12:10:36.878 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:10:37.081 [INFO] Loaded adapter timdettmers/guanaco-65b for block 60
Jul 20 12:10:44.745 [INFO] Loaded enoch/llama-65b-hf block 61, <All keys matched successfully>
Jul 20 12:11:08.242 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:11:08.441 [INFO] Loaded adapter timdettmers/guanaco-65b for block 61
Jul 20 12:11:16.205 [INFO] Loaded enoch/llama-65b-hf block 62, <All keys matched successfully>
Jul 20 12:11:38.475 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:11:38.669 [INFO] Loaded adapter timdettmers/guanaco-65b for block 62
Jul 20 12:11:45.308 [INFO] Loaded enoch/llama-65b-hf block 63, <All keys matched successfully>
Jul 20 12:12:08.372 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:12:08.595 [INFO] Loaded adapter timdettmers/guanaco-65b for block 63
Jul 20 12:12:17.520 [INFO] Loaded enoch/llama-65b-hf block 64, <All keys matched successfully>
Jul 20 12:12:40.703 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:12:41.066 [INFO] Loaded adapter timdettmers/guanaco-65b for block 64
Jul 20 12:12:48.411 [INFO] Loaded enoch/llama-65b-hf block 65, <All keys matched successfully>
Jul 20 12:12:59.529 [INFO] reachability.rpc_check(remote_peer=...ZFKwzs, check_peer=...ZFKwzs) -> False
Jul 20 12:13:11.434 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:13:11.927 [INFO] Loaded adapter timdettmers/guanaco-65b for block 65
Jul 20 12:13:19.812 [INFO] Loaded enoch/llama-65b-hf block 66, <All keys matched successfully>
Jul 20 12:13:43.257 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:13:43.823 [INFO] Loaded adapter timdettmers/guanaco-65b for block 66
Jul 20 12:13:51.392 [INFO] Loaded enoch/llama-65b-hf block 67, <All keys matched successfully>
Jul 20 12:14:16.225 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:14:16.776 [INFO] Loaded adapter timdettmers/guanaco-65b for block 67
Jul 20 12:14:25.466 [INFO] Loaded enoch/llama-65b-hf block 68, <All keys matched successfully>
Jul 20 12:14:49.068 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:14:49.586 [INFO] Loaded adapter timdettmers/guanaco-65b for block 68
Jul 20 12:14:57.751 [INFO] Loaded enoch/llama-65b-hf block 69, <All keys matched successfully>
Jul 20 12:15:20.843 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:15:21.370 [INFO] Loaded adapter timdettmers/guanaco-65b for block 69
Jul 20 12:15:34.991 [INFO] Loaded enoch/llama-65b-hf block 70, <All keys matched successfully>
Jul 20 12:15:57.221 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:15:57.713 [INFO] Loaded adapter timdettmers/guanaco-65b for block 70
Jul 20 12:16:08.368 [INFO] Loaded enoch/llama-65b-hf block 71, <All keys matched successfully>
Jul 20 12:16:29.393 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:16:29.884 [INFO] Loaded adapter timdettmers/guanaco-65b for block 71
Jul 20 12:16:36.503 [INFO] Loaded enoch/llama-65b-hf block 72, <All keys matched successfully>
Jul 20 12:16:57.748 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:16:58.263 [INFO] Loaded adapter timdettmers/guanaco-65b for block 72
Jul 20 12:17:05.251 [INFO] Loaded enoch/llama-65b-hf block 73, <All keys matched successfully>
Jul 20 12:17:26.114 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:17:26.605 [INFO] Loaded adapter timdettmers/guanaco-65b for block 73
Jul 20 12:17:33.660 [INFO] Loaded enoch/llama-65b-hf block 74, <All keys matched successfully>
^[OP^[OPJul 20 12:17:54.764 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:17:55.280 [INFO] Loaded adapter timdettmers/guanaco-65b for block 74
Jul 20 12:18:02.302 [INFO] Loaded enoch/llama-65b-hf block 75, <All keys matched successfully>
^[OP^[OPJul 20 12:18:23.076 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:18:23.551 [INFO] Loaded adapter timdettmers/guanaco-65b for block 75
Jul 20 12:18:30.137 [INFO] Loaded enoch/llama-65b-hf block 76, <All keys matched successfully>
Jul 20 12:18:50.908 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:18:51.420 [INFO] Loaded adapter timdettmers/guanaco-65b for block 76
Jul 20 12:18:57.203 [INFO] Loaded enoch/llama-65b-hf block 77, <All keys matched successfully>
Jul 20 12:19:17.972 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:19:18.472 [INFO] Loaded adapter timdettmers/guanaco-65b for block 77
Jul 20 12:19:23.977 [INFO] Loaded enoch/llama-65b-hf block 78, <All keys matched successfully>
Jul 20 12:19:44.690 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:19:45.199 [INFO] Loaded adapter timdettmers/guanaco-65b for block 78
Jul 20 12:19:50.305 [INFO] Loaded enoch/llama-65b-hf block 79, <All keys matched successfully>
Jul 20 12:20:11.381 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:20:11.894 [INFO] Loaded adapter timdettmers/guanaco-65b for block 79
Jul 20 12:20:11.962 [WARN] [petals.server.reachability.validate_reachability:40] Skipping reachability check because health.petals.ml is down: ConnectionError(MaxRetryError("HTTPConnectionPool(host='health.petals.ml', port=80): Max retries exceeded with url: /api/v1/is_reachable/12D3KooWFS61Xw7XJksfwDg6tYdBAXYuChTCkZTwqxqqWQdHFAQf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbf09f084f0>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))
Jul 20 12:20:14.168 [INFO] Started
Jul 20 12:26:02.132 [INFO] Swarm balance quality: 65.3%
Jul 20 12:26:02.133 [INFO] Swarm is imbalanced, server will load other blocks
Jul 20 12:26:03.947 [INFO] Announced that blocks ['llama-65b-hf.60', 'llama-65b-hf.61', 'llama-65b-hf.62', 'llama-65b-hf.63', 'llama-65b-hf.64', 'llama-65b-hf.65', 'llama-65b-hf.66', 'llama-65b-hf.67', 'llama-65b-hf.68', 'llama-65b-hf.69', 'llama-65b-hf.70', 'llama-65b-hf.71', 'llama-65b-hf.72', 'llama-65b-hf.73', 'llama-65b-hf.74', 'llama-65b-hf.75', 'llama-65b-hf.76', 'llama-65b-hf.77', 'llama-65b-hf.78', 'llama-65b-hf.79'] are offline
Jul 20 12:26:06.251 [INFO] Shutting down
Jul 20 12:26:06.266 [INFO] Module container shut down successfully
Jul 20 12:26:06.492 [INFO] Cleaning up, left 0.3 GiB allocated memory, 6.3 GiB reserved memory
Jul 20 12:26:12.177 [INFO] Announced that blocks [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] are joining
Jul 20 12:26:19.559 [INFO] Loaded enoch/llama-65b-hf block 60, <All keys matched successfully>
Jul 20 12:26:41.387 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:26:41.927 [INFO] Loaded adapter timdettmers/guanaco-65b for block 60
Jul 20 12:26:49.273 [INFO] Loaded enoch/llama-65b-hf block 61, <All keys matched successfully>
Jul 20 12:27:13.392 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:27:13.971 [INFO] Loaded adapter timdettmers/guanaco-65b for block 61
Jul 20 12:27:21.899 [INFO] Loaded enoch/llama-65b-hf block 62, <All keys matched successfully>
Jul 20 12:27:43.149 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:27:43.671 [INFO] Loaded adapter timdettmers/guanaco-65b for block 62
Jul 20 12:27:50.241 [INFO] Loaded enoch/llama-65b-hf block 63, <All keys matched successfully>
Jul 20 12:28:11.106 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:28:11.609 [INFO] Loaded adapter timdettmers/guanaco-65b for block 63
Jul 20 12:28:18.728 [INFO] Loaded enoch/llama-65b-hf block 64, <All keys matched successfully>
Jul 20 12:28:40.008 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout
Jul 20 12:28:40.396 [INFO] Loaded adapter timdettmers/guanaco-65b for block 64
Jul 20 12:28:48.484 [INFO] Loaded enoch/llama-65b-hf block 65, <All keys matched successfully>
Jul 20 12:29:09.470 [INFO] Adapter timdettmers/guanaco-65b has dropout enabled, this server will disable dropout

While this is an extreme example of the problem, I have seen (more often) that parts of the block lists overlap. In such cases, the overlapping blocks are still loaded from scratch instead of being reused.

Are there any obvious fixes for this behavior besides adjusting the --balance_quality setting or pinning blocks? Should we reorder the actions so that the new blocks will be selected before the decision is made to unload the blocks?

borzunov commented 11 months ago

Hi @fadenb,

What you're saying is 100% reasonable, we just didn't have time to do that since it would require additional complexity on the server-side. If you can help with this feature, let us know - we'd be happy to have such a pull request.

iateadonut commented 11 months ago

mine is doing the same thing:

Jul 24 18:26:43 danserver petals[1297]: Jul 24 18:26:43.749 [INFO] Swarm balance quality: 62.8% Jul 24 18:26:43 danserver petals[1297]: Jul 24 18:26:43.749 [INFO] Swarm is imbalanced, server will load other blocks Jul 24 18:26:46 danserver petals[1297]: Jul 24 18:26:46.507 [INFO] Announced that blocks ['llama-65b-hf.0', 'llama-65b-hf.1', 'llama-65b-hf.2', 'llama-65b-hf.3', 'llama-65b-hf.4', 'llama-65b-hf.5', 'llama-65b-hf.6', 'llama-65b-hf.7', 'llama-65b-hf.8', 'llama-65b-hf.9', 'llama-65b-hf.10', 'llama-65b-hf.11', 'llama-65b-hf.12', 'llama-65b-hf.13', 'llama-65b-hf.14', 'llama-65b-hf.15', 'llama-65b-hf.16', 'llama-65b-hf.17', 'llama-65b-hf.18', 'llama-65b-hf.19', 'llama-65b-hf.20', 'llama-65b-hf.21', 'llama-65b-hf.22', 'llama-65b-hf.23', 'llama-65b-hf.24', 'llama-65b-hf.25', 'llama-65b-hf.26', 'llama-65b-hf.27', 'llama-65b-hf.28', 'llama-65b-hf.29', 'llama-65b-hf.30', 'llama-65b-hf.31', 'llama-65b-hf.32'] are offline Jul 24 18:26:51 danserver petals[1297]: Jul 24 18:26:51.787 [INFO] Shutting down Jul 24 18:26:51 danserver petals[1297]: Jul 24 18:26:51.820 [INFO] Module container shut down successfully Jul 24 18:26:51 danserver petals[1297]: Jul 24 18:26:51.959 [INFO] Cleaning up, left 0.5 GiB allocated memory, 11.8 GiB reserved memory Jul 24 18:27:01 danserver petals[1297]: Jul 24 18:27:01.164 [INFO] Announced that blocks [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32] are joining

i'm fixing this with these arguments: --block_indices 28:60 --balance_quality 0.0

@borzunov in creating test to make a better algorithm for choosing other blocks, are there any examples in tests/ of setting up several mock CPU servers that can talk to each other in a test swarm and mock blocks? should the method that chooses blocks always return sequential blocks?

borzunov commented 11 months ago

Hi @iateadonut,

Yes, a server should host a set of sequential blocks. Re mock CPU servers, you can create a private swarm with a really small model like bigscience/bloom-560m and CPU-only servers, like we do in CI tests.

iateadonut commented 11 months ago

is dht_utils.get_remote_module_infos() - is that supposed to return only information about remote servers? when running several CPU servers on my localhost, it returns my own server information.

i ask because block_selection._choose_best_start and and block_selection.should_choose_other_blocks use throughputs derived from get_remote_module_infos(), but get_remote_module_infos() returns a throughput that includes the own server's blocks, there's bound to be some problems.

second, i'm writing tests as unit tests for some of the block selection functions, including _choose_best_start and should_choose_other_blocks. i did not see either of those in the test suite and will add more as necessary as i'm working to figure this out.

borzunov commented 11 months ago

Hi @iateadonut,

dht_utils.get_remote_module_infos() returns information about all servers (remote and your own ones). Note that:

A good example for using this function is the source code of https://health.petals.dev - see the place where get_remote_module_infos() is called.

Re tests for swarm balancing, they are indeed missing at the moment - I'd appreciate if you add them in some form.

Please note that our CI doesn't connect to the public swarm and launches a tiny isolated swarm with BLOOM-560m instead - you'd have to write your tests with this constraint in mind.

iateadonut commented 11 months ago

Thanks. Is there a method that gets only 'remote' module infos?

borzunov commented 11 months ago

@iateadonut No, but you can filter out your local peer_id to keep only remote infos, like we do in should_choose_other_blocks().

borzunov commented 11 months ago

@fadenb @iateadonut For the record, another reason why downloading blocks is slow is that StableBeluga2 weights are distributed in float32 and Llama weights are distributed in float16, while we host them in 4-bit (nf4). This means that we download 8x/4x data than necessary (same for disk space and disk reading time).

So an alternative is to implement functionality allowing to download (or load from disk) the model in nf4 right away. @mryab was working on this functionality for int8 in #273, we may need to revive this PR and prioritize this feature.

iateadonut commented 11 months ago

i'm working now on creating a test for block_selection: https://github.com/iateadonut/petals/blob/danO/tests/test_block_selection.py

the test above works and for the simple mock of 2 servers both running blocks 1-16 of a 24 block model, it passes the tests. I'm going to work to get the current module_infos from the live server so I can mock it's setup and see if I can find the problem.

do you think we should move this block in https://github.com/bigscience-workshop/petals/blob/main/src/petals/server/block_selection.py to its own function (if necessary) for easier testing? if so, should it be called _new_throughput()?:

    moved = True
    while moved:
        servers = list(spans.keys())
        np.random.shuffle(servers)

        moved = False
        for peer_id in servers:
            span = spans[peer_id]
            throughputs[span.start : span.end] -= span.throughput * (1 + eps)

            new_start = _choose_best_start(throughputs, span.length)

            throughputs[span.start : span.end] += span.throughput * eps
            if span.start != new_start:
                span.move_to(new_start)
                moved = True
            throughputs[span.start : span.end] += span.throughput

    new_throughput = throughputs.min()
borzunov commented 11 months ago

@iateadonut Yes, you can extract it into a separate function if it's useful.

iateadonut commented 11 months ago

i have a set of module_infos that includes 80 sets of block-server data info dumps; it is used to mock this test: https://github.com/iateadonut/petals/blob/danO/tests/test_block_selection.py#L18

the throughput of the server looks like this: [3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 3183.04597759 1459.4677943 4613.53125511 1459.4677943 4613.53125511 4613.53125511 1459.4677943 1459.4677943 4613.53125511 4613.53125511 4613.53125511 1459.4677943 1459.4677943 1459.4677943 3850.59180199 3850.59180199 696.52834117 696.52834117 3850.59180199 696.52834117 696.52834117 2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 2899.81165181 2899.81165181 2899.81165181 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014]

the throughput of the server minus the local server looks like this: [2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 2419.34358501 695.76540172 3849.82886253 695.76540172 3849.82886253 3849.82886253 695.76540172 695.76540172 3849.82886253 3849.82886253 3849.82886253 695.76540172 695.76540172 695.76540172 3850.59180199 3850.59180199 696.52834117 696.52834117 3850.59180199 696.52834117 696.52834117 2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181 2899.81165181 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 4743.68463907 2899.81165181 2899.81165181 2899.81165181 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014 2932.25098014]

it yields - Swarm balance quality: 47.7% - then it restarts the service, which holds 33 blocks, and starts again at the same place it started last time, at block 1.

I will do some more work on this this week. I wanted to share the throughput and modified throughput in case anything from those points to a solution I might not see so easily.

iateadonut commented 10 months ago

@borzunov Can you explain this:

https://github.com/bigscience-workshop/petals/blame/063e94b4c8027e1e8d47061681007e9db292734f/src/petals/server/block_selection.py#L94

It looks like you're trying to check the new throughput on the swarm if the local server changes the blocks served AND all other servers change their blocks served as well. Is that correct?

If that's the case, I wonder if this can work well in a live environment, where you have at least a few minutes between each time each server runs should_choose_other_blocks.

What do you think? Should we figure out a different way to find swarm balance quality? Any ideas?

borzunov commented 10 months ago

@iateadonut, in this code, a server simulates what others would do if it moves. This is necessary so that we can know the final throughput it is possible to reach after moving.

For example, imagine that we have 30 blocks and 3 servers hosting blocks 0:10. The total throughput is zero since nobody hosts blocks 20:30.

If we only consider the throughput after the current server moves, then no server will ever move (since if anyone moves to 10:20, the total throughput will be still zero).

So the servers simulate that if they move to 10:20, some other server is likely to move to 20:30, and we'll have non-zero throughput in the end. Then they can decide that moving is actually worth it.

Please refer to a draft of our new paper to find details of how it works: https://openreview.net/pdf?id=HLQyRgRnoXo (pages 19-20, Appendices D-E)

iateadonut commented 10 months ago

I'm running some tests and here's one thing I found - these are only a few minutes apart:

These log should_choose_new_blocks where it compares local_span.start == new_start at https://github.com/bigscience-workshop/petals/blame/063e94b4c8027e1e8d47061681007e9db292734f/src/petals/server/block_selection.py#L87 : '-- new_start and current start' 1692537985.0252135 '22:26:25' '65 2' '-- new_start and current start' 1692538243.7270544 '22:30:43' '2 2'

These logs are just a few minutes apart. I'm running more tests now so I can get timestamps module_infos logs to investigate further.

My suspicion is that, when a single server decides to choose new blocks, by the time it does, the start block is different.

I'll be working to get real time module_infos data to mock and test.

iateadonut commented 10 months ago

i think an easy way to solve this might be to recalculate 'throughputs' 2 times after new_start = _choose_best_start() in a loop waiting 1 minute between each calculation. return False if new_start isn't the same after each calculation.

I have a feeling there may be some problems with this, though. If the problem is two servers colliding would each go through the process at the same time and turn out to have the same problem anyway?

I'm testing this now on the live swarm to see if the bug crops while running the server this way:

def _should_choose_other_blocks(self) -> bool:
        if self.strict_block_indices is not None:
            return False

        module_infos = get_remote_module_infos(self.dht, self.module_uids, latest=True)
        should_choose = block_selection.should_choose_other_blocks(self.dht.peer_id, module_infos, self.balance_quality)

        if False == should_choose:
            return False
        else:
            for i in range(2):
                wait_time = 90 + random.randint(-30, 10)
                time.sleep(wait_time)

                module_infos = get_remote_module_infos(self.dht, self.module_uids, latest=True)
                pprint('--retrying should_choose_other_blocks')
                should_choose = block_selection.should_choose_other_blocks(self.dht.peer_id, module_infos, self.balance_quality)

                if False == should_choose:
                    return False

        return should_choose
iateadonut commented 10 months ago

These are some logs I've taken from running the above within server.py:

'-- start new_start' '0 0' '-- start new_start' '0 1' '--retrying should_choose_other_blocks' '-- start new_start' '0 0' '-- start new_start' '0 0' ... '0 0' '-- start new_start' '0 40' '--retrying should_choose_other_blocks' '-- start new_start' '0 0' '-- start new_start' '0 0' ... '-- start new_start' '18 30' '-- start new_start' '18 33' '--retrying should_choose_other_blocks' '-- start new_start' '18 30' '-- start new_start' '18 30' ...

You can see here that it has been working well to make sure unnecessary restarts do not happen.

The start new_start line in the logs is from should_choose_other_blocks that shows the current start and suggested new start.

It did fail a rebalancing here: '-- start new_start' '40 30' '-- start new_start' '40 30' '--retrying should_choose_other_blocks' '-- start new_start' '40 30' '--retrying should_choose_other_blocks' '-- start new_start' '40 30' '-- choose_best_blocks; used when restarting' '-- start new_start' '30 18' '--retrying should_choose_other_blocks' '-- start new_start' '30 18' '--retrying should_choose_other_blocks' '-- start new_start' '30 18' '-- choose_best_blocks; used when restarting' '-- start new_start' '18 18' '-- start new_start' '18 18'

as it ended up rebalancing twice. I don't know why that happened, but otherwise, this small change prevented unnecessary rebalancing at least 15x in a few days.

I'll continue to use this in the newest versions on my server and keep logs with time stamps moving forward.

I've created a pull request: https://github.com/bigscience-workshop/petals/pull/493

Let me know if there should be any changes or other ways to move forward.

iateadonut commented 10 months ago

just updating with some more logs:

$ grep -E '--retry|choose_best' -B5 -A10 ./log-1693523246

'-- choose_best_blocks; used when restarting'
'2023-09-01 08:43:59'
'-- start new_start'
'36 36'
'2023-09-01 08:45:35'
'-- start new_start'
'36 36'
'2023-09-01 08:46:45'
'-- start new_start'
'36 36'
'2023-09-01 08:47:54'
--
'-- start new_start'
'36 0'
'2023-09-03 12:05:22'
'-- start new_start'
'36 15'
'--retrying should_choose_other_blocks'
'2023-09-03 12:07:07'
'-- start new_start'
'36 0'
'2023-09-03 12:07:46'
'-- start new_start'
'36 0'
'2023-09-03 12:08:28'
'-- start new_start'
'36 0'
'2023-09-03 12:09:31'
--
'-- start new_start'
'36 36'
'2023-09-04 22:06:40'
'-- start new_start'
'36 13'
'--retrying should_choose_other_blocks'
'2023-09-04 22:08:20'
'-- start new_start'
'36 36'
'2023-09-04 22:09:19'
'-- start new_start'
'36 36'
'2023-09-04 22:10:32'
'-- start new_start'
'36 36'
'2023-09-04 22:10:36'
--
'-- start new_start'
'36 36'
'2023-09-05 10:32:37'
'-- start new_start'
'36 4'
'--retrying should_choose_other_blocks'
'2023-09-05 10:34:17'
'-- start new_start'
'36 14'
'2023-09-05 10:34:58'
'-- start new_start'
'36 14'
'2023-09-05 10:36:12'
'-- start new_start'
'36 14'
'2023-09-05 10:36:25'
--
'-- start new_start'
'36 36'
'2023-09-05 16:18:01'
'-- start new_start'
'36 0'
'--retrying should_choose_other_blocks'
'2023-09-05 16:19:37'
'-- start new_start'
'36 0'
'--retrying should_choose_other_blocks'
'2023-09-05 16:20:56'
'-- start new_start'
'36 30'
'2023-09-05 16:21:34'
'-- start new_start'
'36 30'
'2023-09-05 16:22:55'
'-- start new_start'
'36 30'
'2023-09-05 16:24:30'

-- choose best blocks is in the log and represents when 'choose_best_blocks' is run, when the blocks are reloaded.

As you can see, over 5 days continually online, this edit has stopped this server from unnecessarily reloading. The last time this happened, that was probably because the swarm balance had already improved.