Open flaviusburca opened 3 weeks ago
Hi! If the model mentioned is CohereForAI/c4ai-command-r-v01, we believe it's possible. It uses typical RoPE. We quickly checked its implementation in Hugging Face's Transformers library. It looks pretty similar to Llama. You can refer to our Llama implementation to modify Cohere's code.
One thing that could matter is that CohereForAI/c4ai-command-r-v01 uses a very large RoPE theta—8,000,000.0, which is much larger than that of other models. This may cause the empirical rule for selecting good hyperparameters (group size, neighbor window) to fail. You may need to try several combinations to find a better one.
Is it possible to adapt this to cohere command-r models ?