Closed yiyiwwang closed 6 months ago
Thanks for your interest in LMFlow! @2003pro Can you please take a look?
Dear authors, I have some questions about the Task Tuning in your paper LMFlow:
- Do you use the training set of PubMedQA and MedMCQA to get the LLaMA-7/13/30B-LoRA models? And then test on the validation/test set of PubMedQA and MedMCQA (in-domain) and MedQA-USMLE (out-of-domain), right? I‘m curious about if there is base LLaMA 30B model?
- The task tuning is continuous pretraining, right? The continuous pretraning is occured on the LoRA, not full parameters fine-tuning, is my understanding right? If so, will the differece be large between LoRA tuning and Full tuning in the continuous pretraining?
- In your tuned medical models, do you find any forgetting problems about ability loss in general tasks? In the continuous pretraining, do you think it is helpful to add general data to allieviate this problem? Can you give me some suggestions about how to add the general data? Thank you very much.
Your understanding of data usage is correct. We use the training set from PubMedQA and MedMCQA to do lora training of LLaMA series models.
For your question about LLaMA 30B model, there are LLaMA-30B models for LLaMA-1 (https://arxiv.org/abs/2302.13971), and model weights can be accessed by making an application to Meta. https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform
However, LLaMA-2 has not released 30B ckpt yet and only has code-llama-30B.
Yep. Since we do not add diverse instructions in datasets from PubMedQA and MedMCQA, we think its ok to call our training as continues pertaining.
LoRA is limited in do pertaining from scratch. But if your data scale is not too large (like < 1B tokens), LoRA is also ok.
I guess you are talking about forgetting the problem.
Your solution about adding general data is called data replay, which is effective but needs to pick a large amount of data, like our 1B tokens to get it to work.
I suggest you use an easier solution called model average, which means that you merge the model that was not trained by your data. This does not need to do complex data engineering.
For model average, you may check these works: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time Mitigating the Alignment Tax of RLHF
Thanks very much for your clear explanation, and your suggestions are helful for me. BTW, I want to confirm that does your mention "like our 1B tokens to get it to work" refer to your LMFlow Dataset (http://lmflow.org:5000/lmflow_data.tar.gz) ?
I just asked @2003pro, we think it is more like some general medical dataset, like pubmed central in pile (https://github.com/EleutherAI/pile-pubmedcentral). Hope this information can be helpful 😄
I just asked @2003pro, we think it is more like some general medical dataset, like pubmed central in pile (https://github.com/EleutherAI/pile-pubmedcentral). Hope this information can be helpful 😄
Thank you very much. I will learn it.
Dear authors, I have some questions about the Task Tuning in your paper LMFlow: