Llama-2 13b does not learn knowledge from training data

Eichhof commented 11 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

The model should learn the information/knowledge from the training data. From the example training data below it should learn information about Leon Klein (age etc.).

Current behaviour

When conversing with the chatbot, the chatbot does not know about Leon Klein.

Steps to reproduce

I'm fine-tuning Llama-2 13b with Axolotl. My dataset for fine-tuning looks as follows:

# dataset.jsonl 
{"text": "### Human: What can you tell me about Leon Klein### Chatbot: He is working in the US for a small start-up company.### Human: How old is he?### Chatbot: He is 40 years old."} 
{"text": "### Human: Who's coming tonight?### Chatbot: No one, it's literally Monday."} 
...

This is just an example of two conversations. My training data consists of around 4200 conversations. Each conversation consists of 20 - 40 turns. The conversations contain facts about people. The facts repeat over multiple conversations (with different wordings of course). The fine-tuned model adapts the data format (i.e., it also outputs turns separated by ###) but it does not remember specific information.

Config yaml

base_model: meta-llama/Llama-2-13b-hf
base_model_config: meta-llama/Llama-2-13b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: Eichhof/einstein-llama
    type: completion
    field: text
dataset_prepared_path: last_run_prepared
hub_model_id: Eichhof/Llama-2-13b-hf-Einstein
val_set_size: 0.01
output_dir: ./qlora-out

adapter: qlora
lora_model_dir:

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: "llama-einstein"
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model: "checkpoint"

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 37
eval_steps: 0.05
eval_table_size:
save_steps:
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I have also tried with a learning rate of 0.00018 for 5 epochs and a constant learning rate scheduler. And I tried a learning rate of 0.001. It did not solve the issue.

Could the dataset format be the issue? Could alpaca format solve the issue?

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/575a082

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

veezbo commented 11 months ago

Please reference this article for understanding why you should not expect to see knowledge gain after fine-tuning: https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts

Specifically you can reference the "Juliet was in love with someone" example in the article.

Eichhof commented 10 months ago

Thank you for the link. I had a read and it makes sense. Before switching to Llama-2 I was fine-tuning GPT-J. GPT-J was learning also all the knowledge from the training data (same training data as I used for Llama-2. Why is there such a huge difference between Llama-2 and GPT-J in terms of learning new knowledge?

tmm1 commented 10 months ago

Can you share what benchmark you are using to measure whether knowledge is learned? What were your results between the two models?

Maybe there is some architecture difference between the two models or their implementation?

Eichhof commented 10 months ago

I'm not using a benchmark. I just have knowledge about certain persons or companies in my training dataset (this knowledge is not available in the vanialla Llama-2 and GPT-J model). I then just tested this knowledge by generating responses (in a conversation) from the models. GPT-J can answer questions about the persons and companies (answers are similar to training data), but Llama-2 does not know them (it provides answers like that it does not know this person or company).

axolotl-ai-cloud / axolotl