FiveTechSoft / tinyMedical

TinyLLama trained with medical dataset and saved as GGUF file
MIT License
5 stars 5 forks source link

How much medical domain knowledge retained in the model? #1

Closed jaykchen closed 8 months ago

jaykchen commented 8 months ago

@FiveTechSoft based on my understanding of the training script https://github.com/FiveTechSoft/tinyMedical/blob/770b5412f8704020bcbf93d1c5e449cb645e9ab9/train.py#L47

This is a lora fine-tuning of TinyLlama model on medical material. I did quite some online searches in the last a few days, and my impression is that fine-tuning like this helps base model to better logical thinking/improve styling in handling new tasks, but it doesn't really infuse new domain knowledge to the base model. Do you have a similar belief? Are you planning further work to infuse domain knowledge to the base model?

I'm searching for an efficient way of infusing new domain knowledge to a base model without full pre-training, but make more changes to the base model than what lora does.

FiveTechSoft commented 8 months ago

Dear Ji Chen,

How would you propose to enhance it ? We do appreciate your suggestions

many thanks

El sáb, 17 feb 2024 a las 22:14, Ji Chen @.***>) escribió:

@FiveTechSoft https://github.com/FiveTechSoft based on my understanding of the training script https://github.com/FiveTechSoft/tinyMedical/blob/770b5412f8704020bcbf93d1c5e449cb645e9ab9/train.py#L47

This is a lora fine-tuning of TinyLlama model on medical material. I did quite some online searches in the last a few days, and my impression is that fine-tuning like this helps base model to better logical thinking/improve styling in handling new tasks, but it doesn't really infuse new domain knowledge to the base model. Do you have a similar belief? Are you planning further work to infuse domain knowledge to the base model?

I'm searching for an efficient way of infusing new domain knowledge to a base model without full pre-training, but make more changes to the base model than what lora does.

— Reply to this email directly, view it on GitHub https://github.com/FiveTechSoft/tinyMedical/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOC2UIKZBFW5UKIBRNPMITYUEMTZAVCNFSM6AAAAABDNSY5POVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2DANRSGA3DMNI . You are receiving this because you were mentioned.Message ID: @.***>

-- Antonio Linares www.fivetechsoft.com

jaykchen commented 8 months ago

This the best knowledge I can get for now: https://www.reddit.com/r/LocalLLaMA/comments/14vnfh2/my_experience_on_starting_with_fine_tuning_llms/

The key learning:

directly train a pre-trained model with less than 50000 data row is more or less useless. I would think of directly train a model when I have more than 100k data rows, for a 13B model and at least 1 mil for a 65B model.

with smaller datasets, it is efficient to train LoRA of qLoRA.

I prefer to train a 4 bit qLora 30B model than a fp16 LoRA for a 13B model (about same hw requirements, but the results with the 4bit 30B model are superior to the 13B fp16 model)

My best guess: 1) lora/qlora doesn't infuse domain knowledge 2) need some training material like pretraining 3) need a lot of material, say if you have 500 samples, use some techniques to create variations of it and expand it to 100x or more, then go do pretraining type of work

jaykchen commented 8 months ago

I think this is the best answer I can ever find before a new breakthrough in this field: https://arxiv.org/abs/2310.08975

Here is the core of the paper: ChatKBQA is a generate-then-retrieve KBQA framework for knowledge base question answering (KBQA) using fine-tuned open source LLMs. First, the ChatKBQA framework needs to efficiently fine-tune an open-source LLM based on the (natural language question, logical form) pairs in the KBQA dataset by instruction tuning. The fine-tuned LLM is then used to convert the new natural language questions to according candidate logical forms by semantic parsing. Then, ChatKBQA retrieves the entities and relations in these logical forms at the phrase level, and searches for the logical forms that can be executed against KB after being converted to SPARQL. Finally, the SPARQL obtained after conversion is utilized to get the final set of answers and achieve interpretable and knowledge-required answers to natural language questions.