They use the same "MistralForCausalLM" structure and seem to share some parameters such as intermediate_size, and I was wondering if it would be possible to merge them together.
The tokenizers and vocabulary size are radically different between the models (assuming Mistral v0.3 7B), as is hidden size. I would be surprised if the result is coherent.
They use the same "MistralForCausalLM" structure and seem to share some parameters such as intermediate_size, and I was wondering if it would be possible to merge them together.