Roadmap - Githubissues

deep-diver commented 6 months ago

We have discussed the following so far.

Decide which domain
- Math(GSM8k), Code(Stack 2), General Purpose(???)
Start the experiments
- select small fraction of the original dataset (~ x-%)
- fine-tune a model on a growing number of synthetic dataset based on the x % of original dataset
- evaluating the performance of the model to find out the saturation point
- return on investment(ROI) is also an important factor (API cost of service LLM + FT compute resources + FT time)

Need to make a consensus on the following points for work splitting before starting off (I will add ✅ when there is no more discussion):

[X] ~~host seed datasets on Hugging Face Dataset Hub with supported structure by (need messages field)~~seed prompts from no_robots dataset by Hugging Face. Split each categories into separate splits => resulting dataset.
[X] leverage alignment-handbook for fine-tuning
[X] choosing a similar physical environment (A800s, A100s, H100s)
[X] choosing a set of sLLM and the fine-tuning methods (Gemma 2B/7B, LLaMA3 8B, Phi-3 3B), (LoRA, QLoRA with their specific hyper-params - 4 <= alpha <= 32, 8 <= rank <= 64, 0.05 <= dropout <= 0.5)
[X] choosing the same training hyper-param setup (batch size?, if we use QLoRA, FSDP+QLoRA?, ...)
[X] choosing service LLM for synthetic data generation and sLLM evaluation (Gemini, GPT4, Claude3?)
- For the first iteration, let's use Gemini 1.0 Pro.
- I have some credits for GCP(Gemini), OpenAI(GPT4), and AWS(Claude3)
- which service LLM specifically (gpt-4-turbo-2024-04-09, gpt-4-turbo, gemini-1.0-pro, gemini-1.5-pro, claude-3-sonnet-20240229-v1, claude-3-haiku-20240307-v1, ...)
[ ] designing prompts for synthetic data generation(current prompt) and sLLM evaluation(current prompt)

Please leave any comments if I have missed anything. Also, please leave any comments to improve the idea. I will update this main thread as we discuss. Hopefully, we will get into the code implementation and experiments at the end of this week.

deep-diver commented 6 months ago

choosing a set of sLLM and the fine-tuning methods (Gemma, LLaMA3?), (LoRA, QLoRA with their specific hyper-params such as alpha, rank, ..)

I think the size of model shouldn't be bigger than 7~8B. Hence, Gemma 2B, Gemma 7B, LLaMA3 8B would be ideal. For the hyper-parameter values related to LoRA / QLoRA, we could start small from rank=4 and alpha=8, rank=8 and alpha=16, ....

choosing the same physical environment (such as A100 80GB GPUs?)

I know @juyongjiang and @fanwangkath can access to 8 x (A100 80GB) DGX system, and I could get access to the similar (maybe less GPU but A100 80GB is possible)

juyongjiang commented 6 months ago

Good job!! I agree with your roadmap. For the sLLM, how about adding Phi-3 (3.8B), a powerful small-scale LM?

juyongjiang commented 6 months ago

choosing a set of sLLM and the fine-tuning methods (Gemma, LLaMA3?), (LoRA, QLoRA with their specific hyper-params such as alpha, rank, ..)

I think the size of model shouldn't be bigger than 7~8B. Hence, Gemma 2B, Gemma 7B, LLaMA3 8B would be ideal. For the hyper-parameter values related to LoRA / QLoRA, we could start small from rank=4 and alpha=8, rank=8 and alpha=16, ....

choosing the same physical environment (such as A100 80GB GPUs?)

I know @juyongjiang and @fanwangkath can access to 8 x (A100 80GB) DGX system, and I could get access to the similar (maybe less GPU but A100 80GB is possible)

I agree with you, Gemma, Phi, Llama is okay. You mentioned LoRA, so we are preparing to use PEFT, right?

@deep-diver Yeah, we can access 8 x (A800 80GB), not A100 due to the restrictions of the USA for China. However, A800 is similar to A100, I think it is okay. Do you think so?

deep-diver commented 6 months ago

8 x (A800 80GB) sounds OK to me :)

Also, let's keep Phi-3 (3.8B) in the list too, and let's see if time allows.

I agree with you, Gemma, Phi, Llama is okay. You mentioned LoRA, so we are preparing to use PEFT, right?

Yes! I think so

deep-diver commented 6 months ago

designing prompts for synthetic data generation(current prompt) and sLLM evaluation(current prompt)

this is another example prompt from LMSYS's Arena Hard

system_prompt: "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\"."

deep-diver commented 6 months ago

@juyongjiang, @fanwangkath also, please follow this sample config file that is used with alignment-handbook, and and get familiarized with it.

Full options can be found here

deep-diver / llamaduo-spinoff

Roadmap #1