How to integrate to transformers-based mixtral

databricks / megablocks

Apache License 2.0

1.11k stars 154 forks source link

Hi, this is awesome work. I'm wondering if there is a minimal way to integrate megablocks into transformers codebase for the mixtral architecture?

Would simply replacing the MixtralSparseMoeBlock with dmoe.dMoE with proper configuration works?

# from transformers 

class MixtralDecoderLayer(nn.Module):
    def __init__(self, config: MixtralConfig, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = MIXTRAL_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

        self.block_sparse_moe = MixtralSparseMoeBlock(config)
        ....

Thanks!

databricks / megablocks

How to integrate to transformers-based mixtral #84