Open jaggzh opened 6 months ago
@jaggzh Great question! We're actively exploring this and open to collaborations. So far, there aren't any concise papers discussing merging different architectures. It's worth noting that this specific problem might require adjusting hidden sizes and layer numbers.
However, there's one notable work from DeepMind called "Compsistion Augment LLMs," where they utilize cross-attention to merge models, but it requires training. https://arxiv.org/abs/2401.02412
If focusing solely on merging, a current approach could involve initially pruning an LLM, like reducing the parameters in Mistral. Then, train that model and apply Data Flow Space Evolutionary merging to merge the pruned model with a larger one. This could be simpler since we wouldn't have to deal with varying hidden sizes. Great question! We're actively exploring this and open to collaborations. So far, there aren't any concise papers discussing merging different architectures. It's worth noting that this specific problem might require adjusting hidden sizes and layer numbers.
I'm wondering about the possibility of fine-tuning a small model (limited resources for large), upscaling (in the "best" way possible), and merging with a large model. The purpose would be because inference has less overhead and can run on systems where training such a model is not possible. But to incorporate knowledge into such a model.. maybe... Has this been done? What are the approaches it would take?