Open Vectorrent opened 2 days ago
The more I experiment with LayerShuffle, the less I feel like it could ever possibly work here. No matter what I've tried, LayerShuffle leads to complete model degeneration, like this: And really, how could this ever work? A fundamental principle of the transformer architecture is the fact that, through each sequential layer, the model is learning how to transform and "compose" intermediate representations of data, with each building upon the previous layers. When you naively-shuffle those layers, you are creating an extreme form of regularization, such that every layer would need to know how to transform the hidden states of every other layer, in any possible order. Even if this could be made to work on a smaller scale, adding more layers will almost certainly exacerbate the problem further. Sequential models simply do not have such problems with model degeneration.
We are going to need a different kind of decentralization strategy.
I think that a graph-based approach is one potential option, though it's not clear to me how that would need to work.
Another potential option would be a swarm-based/ensemble approach, where many tiny, independent models are asked to work in-tandem with one another, towards a common goal. Certainly, this is the approach that most AI organizations are using today, with multi-agent orchestration tooling, and Chain of Thought prompting. One model generates an output, which is passed to another in "plain text," which is passed to another... many, many times - until a final output is created. Of course, the main challenge here is that of speed and compute; routing through a single transformer on desktop compute is already hard, but routing through many of them is even harder. It splits the computation graph across many independent models, and it would require training many independent models, simultaneously. Not to mention, with such small models - actually making any of them "behave" correctly would be a very real, potentially impossible task.
I don't particularly like either of these options, but it's where we stand.
I do not have a clear idea about how Hivemind should be integrated, yet. Let this issue exist to document a discussion around potential solutions.
Currently, each layer in a decoder is called an "expert", and we attach each expert to the Hivemind as an independent layer/expert, available for computation. When connected to Hivemind, a peer will automatically crawl the DHT, in search of active experts. If one is found (and that peer's cache of usable experts is not full), the new expert will be automatically-added to their machine - and it will be available for computations.
Currently, remote experts will never be used, unless the
--shuffle
argument is also being used. A model does not know how to route through a remote peer, without the random permutations introduced by shuffling during training. The exact method for permutation may need to change (I am not having much luck with random, naive shuffling, right now).Now, there are clear problems with this design:
Anyway, those are my thoughts. More to come.