We need to develop a Hivemind strategy

I do not have a clear idea about how Hivemind should be integrated, yet. Let this issue exist to document a discussion around potential solutions.

Currently, each layer in a decoder is called an "expert", and we attach each expert to the Hivemind as an independent layer/expert, available for computation. When connected to Hivemind, a peer will automatically crawl the DHT, in search of active experts. If one is found (and that peer's cache of usable experts is not full), the new expert will be automatically-added to their machine - and it will be available for computations.

Currently, remote experts will never be used, unless the --shuffle argument is also being used. A model does not know how to route through a remote peer, without the random permutations introduced by shuffling during training. The exact method for permutation may need to change (I am not having much luck with random, naive shuffling, right now).

Now, there are clear problems with this design:

Local and remote optimizers are not connected, nor could we really expect them to be. If we were to add many remote experts to our local optimizer, VRAM consumption would grow dramatically, speeds would decrease, and we would be giving remote/untrusted peers A LOT of influence over the state of our model (which may not be desirable).
So, without optimizers, we currently treat remote experts as if they are frozen, static layers - without any trainable parameters. This is a well-known strategy often employed in continuous-learning strategies (freezing some expert modules, in an MoE, for example).
However, our situation is somewhat worse than that - because we do not even "know" what parameters that remote expert has. Unlike the well-known continuous-learning approaches, the only thing we know is the input and output shape of that remote expert. Thus, integrating the remote expert in a meaningful way becomes a challenge.
Currently, we just send our inputs to the remote peer, receive their response, and "integrate" them via a residual connections (which restores the gradient path, over the non-differentiable expert's outputs). This is a painfully simple strategy, and almost certainly won't be sufficient.
Alternative strategies include differentiable sampling (i.e. a Gumbel-Softmax), residual gating (a less-naive form of expert-influence), or - my favorite option - we download an offline/local version of that remote peer's parameters, and we train some local, differentiable parameters to respect those remote weights. I'm currently thinking that an exponential moving average over an approximation of the remote weights, with some kind of gating strategy... this may allow our local model to learn the inductive biases in the remote expert, in a differentiable yet cheap fashion. I could be completely mistaken.

Anyway, those are my thoughts. More to come.

The more I experiment with LayerShuffle, the less I feel like it could ever possibly work here. No matter what I've tried, LayerShuffle leads to complete model degeneration, like this: And really, how could this ever work? A fundamental principle of the transformer architecture is the fact that, through each sequential layer, the model is learning how to transform and "compose" intermediate representations of data, with each building upon the previous layers. When you naively-shuffle those layers, you are creating an extreme form of regularization, such that every layer would need to know how to transform the hidden states of every other layer, in any possible order. Even if this could be made to work on a smaller scale, adding more layers will almost certainly exacerbate the problem further. Sequential models simply do not have such problems with model degeneration.

We are going to need a different kind of decentralization strategy.

I think that a graph-based approach is one potential option, though it's not clear to me how that would need to work.

Another potential option would be a swarm-based/ensemble approach, where many tiny, independent models are asked to work in-tandem with one another, towards a common goal. Certainly, this is the approach that most AI organizations are using today, with multi-agent orchestration tooling, and Chain of Thought prompting. One model generates an output, which is passed to another in "plain text," which is passed to another... many, many times - until a final output is created. Of course, the main challenge here is that of speed and compute; routing through a single transformer on desktop compute is already hard, but routing through many of them is even harder. It splits the computation graph across many independent models, and it would require training many independent models, simultaneously. Not to mention, with such small models - actually making any of them "behave" correctly would be a very real, potentially impossible task.

I don't particularly like either of these options, but it's where we stand.

0-5788719150923125 / praxis

We need to develop a Hivemind strategy #27