Open deep-diver opened 6 months ago
choosing a set of sLLM and the fine-tuning methods (Gemma, LLaMA3?), (LoRA, QLoRA with their specific hyper-params such as alpha, rank, ..)
I think the size of model shouldn't be bigger than 7~8B. Hence, Gemma 2B, Gemma 7B, LLaMA3 8B would be ideal. For the hyper-parameter values related to LoRA / QLoRA, we could start small from rank=4 and alpha=8
, rank=8 and alpha=16
, ....
choosing the same physical environment (such as A100 80GB GPUs?)
I know @juyongjiang and @fanwangkath can access to 8 x (A100 80GB) DGX system, and I could get access to the similar (maybe less GPU but A100 80GB is possible)
Good job!! I agree with your roadmap. For the sLLM, how about adding Phi-3 (3.8B), a powerful small-scale LM?
choosing a set of sLLM and the fine-tuning methods (Gemma, LLaMA3?), (LoRA, QLoRA with their specific hyper-params such as alpha, rank, ..)
I think the size of model shouldn't be bigger than 7~8B. Hence, Gemma 2B, Gemma 7B, LLaMA3 8B would be ideal. For the hyper-parameter values related to LoRA / QLoRA, we could start small from
rank=4 and alpha=8
,rank=8 and alpha=16
, ....choosing the same physical environment (such as A100 80GB GPUs?)
I know @juyongjiang and @fanwangkath can access to 8 x (A100 80GB) DGX system, and I could get access to the similar (maybe less GPU but A100 80GB is possible)
I agree with you, Gemma, Phi, Llama is okay. You mentioned LoRA, so we are preparing to use PEFT, right?
@deep-diver Yeah, we can access 8 x (A800 80GB), not A100 due to the restrictions of the USA for China. However, A800 is similar to A100, I think it is okay. Do you think so?
8 x (A800 80GB)
sounds OK to me :)
Also, let's keep Phi-3 (3.8B)
in the list too, and let's see if time allows.
I agree with you, Gemma, Phi, Llama is okay. You mentioned LoRA, so we are preparing to use PEFT, right?
Yes! I think so
designing prompts for synthetic data generation(current prompt) and sLLM evaluation(current prompt)
this is another example prompt from LMSYS's Arena Hard
system_prompt: "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\"."
@juyongjiang, @fanwangkath also, please follow this sample config file that is used with alignment-handbook
, and and get familiarized with it.
Full options can be found here
We have discussed the following so far.
Need to make a consensus on the following points for work splitting before starting off (I will add ✅ when there is no more discussion):
host seed datasets on Hugging Face Dataset Hub with supported structure by (needseed prompts from no_robots dataset by Hugging Face. Split each categories into separate splits => resulting dataset.messages
field)alignment-handbook
for fine-tuning4 <= alpha <= 32
,8 <= rank <= 64
,0.05 <= dropout <= 0.5
)gpt-4-turbo-2024-04-09
,gpt-4-turbo
,gemini-1.0-pro
,gemini-1.5-pro
,claude-3-sonnet-20240229-v1
,claude-3-haiku-20240307-v1
, ...)Please leave any comments if I have missed anything. Also, please leave any comments to improve the idea. I will update this main thread as we discuss. Hopefully, we will get into the code implementation and experiments at the end of this week.