Closed 5joon2 closed 3 months ago
Yes this repo is compatible with other LLMs; I have tested Llama & GPT-J and I expect Qwen to work too
I can't find llama architecture based GRIT model class (which uses bidirectional attention) in grit/scripts
directory. Could you share your one?
I think you can just use the Mistral scripts but change
MistralDecoderLayer
to LlamaDecoderLayer
here
https://github.com/ContextualAI/gritlm/blob/f0c3820e9dde0ea2beb0c4ede775eeaac3398eda/scripts/configs/config_8gpusfsdp_m7.yml#L14
&
meta-llama/Llama-2-7b-hf
to mistralai/Mistral-7B-v0.1
here
https://github.com/ContextualAI/gritlm/blob/f0c3820e9dde0ea2beb0c4ede775eeaac3398eda/scripts/training/train_gritlm_7b.sh#L54
Got it Thank you for advice. After did some experiments, I would share results 👍
@Muennighoff Is there any reason we couldn't do this for llama 3 as well?
Is it reasonable to think that the LLM that has the best performance at generating answers would also be the best at retrieving documents after it's been finetuned with the GRIT approach? Or are there other factors I'm not taking into account that play a more important role in embedding generation?
Is there any reason we couldn't do this for llama 3 as well?
You can do it with Llama3
Is it reasonable to think that the LLM that has the best performance at generating answers would also be the best at retrieving documents after it's been finetuned with the GRIT approach?
Generally, yes. I'd expect that finetuning Llama3 with GRIT to be better than the current GritLM-7B.
Hi, I am confused about the implementation of bidirectional attention
for other LLMs. such as llama
.
@5joon2 Have you implemented the transformers/models/llama/modeling_llama.py
code? If the answer is yes, please share it with me. I also found that LLM2Vec has implemented the bidirectional attention, but I think is_causal
is not an available arg for forward()
in their implementation. I do want is_causal
to become an arg
for every forward to control whether to use bidirectional attention
or not to adapt both generation and embedding tasks at the same time.
Hi @Muennighoff, based on your response to adapt bidirectional attention
to llama
, I am confused about it. Don't we need to change the code for transformers/models/llama/modeling_llama.py
.
Yes in order to use bidirectional attention you also need to change the model's modeling code as mentioned here: https://github.com/ContextualAI/gritlm/issues/34#issuecomment-2123719509
But if you don't care about bidirectional attention you can use all the other models as is.
Thanks @Muennighoff. I have made a comparison with your implemented modeling_mistral_gritlm.py
and vanilla modeling_mistral.py
, but I have found that there are more than 20+ modifications regarding is_causal
function and was totally lost.
I am not an expert in transformers, would you mind providing us with a script for modeling_llama_gritlm.py
like modeling_mistral_gritlm.py
with the arg=is_causal
?
Nice work! Your team's work is very impressive :)
I have a question,
like Llama or QWEN, there are such LLM architectures very similar to Mistral but a little bit different at some layers architecture.
This repo's codebase can be compatible to any other pre-trained LLM architectures?