deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself
https://coder.deepseek.com/
MIT License
6.6k stars 461 forks source link

Trying to finetune DeepSeek-Coder on custom Dataset #137

Closed A-Janj closed 6 months ago

A-Janj commented 6 months ago

image

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format: [{"instruction": "", "output": ""}, {"instruction": "", "output": ""}]

The following is the error I am getting when running the finetune_deepseekcoder.py code.

DejianYang commented 6 months ago

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

A-Janj commented 6 months ago

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

I have 64 GB RAM (CPU memory). How much does deepseek require to get finetuned?

DejianYang commented 6 months ago

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

I have 64 GB RAM (CPU memory). How much does deepseek require to get finetuned?

I do not have exact number of RAM required by finetune. The DeepSpeed is used in the finetune script which requires a lot of RAM to do cpu offload. You can try another config of deepspeed to reduce the cpu memory used if you have enough GPU memory. Maybe you can try our 1b model first.

A-Janj commented 6 months ago

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

I have 64 GB RAM (CPU memory). How much does deepseek require to get finetuned?

I do not have exact number of RAM required by finetune. The DeepSpeed is used in the finetune script which requires a lot of RAM to do cpu offload. You can try another config of deepspeed to reduce the cpu memory used if you have enough GPU memory. Maybe you can try our 1b model first.

Can you give me an idea of how much GPU VRAM I would require if I have 64 GB system RAM? Moreover should I put false instead of true in offload cpu parameters in ds_config_zero3.json, like: "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": false }, "offload_param": { "device": "cpu", "pin_memory": false },

A-Janj commented 6 months ago

Screenshot from 2024-03-13 13-06-24

@DejianYang Can You Help Me?!

I was able to finetune the 6.7b parameter model using 1 x H100 80GB SXM5 (80 GB VRAM and 251 GB RAM 24 vCPU). The Finetune script created files in the given output folder, but model.safetensors file is only 539.6 kB.

Doing inference on the finetuned directory first gave the following error:

"RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM: size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32256, 2048]). You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method."

After setting ignore_mismatched_sizes=True argument in the from_pretrained method the model is giving gibberish. You can see the inference code and the output in the screenshot.

Am I missing something?

DejianYang commented 6 months ago

size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32256, 2048]).

https://huggingface.co/docs/accelerate/usage_guides/deepspeed#saving-and-loading

seancarmod-y commented 6 months ago

@DejianYang I'm looking to fine-tune deepseek-coder-1.3b-base. Ideally, I'd like to do it using huggingface libraries as I have done for tinyllama in the attached file. Is this possible or do I need to use the finetune_deepseekcoder.py (can this even be used for the 1.3b model?) fine_tune_tiny_llama.txt

DejianYang commented 6 months ago

fine_tune_tiny_llama.txt Yes, you can use the script to finetune our model just like you are using other llama models.

A-Janj commented 6 months ago

size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32256, 2048]).

https://huggingface.co/docs/accelerate/usage_guides/deepspeed#saving-and-loading

Thank you so much for your time and help. This helped me in understanding the different deepspeed config files and ZeRO stages. What I did to resolve my issue was to put trainer.save_model() and tokenizer in the finetune script, done as below:

trainer.train()
trainer.save_model("SaveOutputFolder")
trainer.tokenizer.save_pretrained("SaveOutputFolder")
trainer.save_state()

Thanks once again so much for all your help.

seancarmod-y commented 6 months ago

fine_tune_tiny_llama.txt Yes, you can use the script to finetune our model just like you are using other llama models.

Hi, that's great, thanks. I can't seem to find documentation on how to format the custom dataset. For both Llama2 and Tinyllama I have formatted it as a csv where there is a 'text' column. Is there a similar format that I can follow for deepseek-coder-1.3b? The formats of each row are: Llama2: [INST] prompt [/INST] Llama2 answer <\s> Tinyllama: <|user|> prompt <|assistant|> Tinyllama answer I then load the dataset like this: from datasets import load_dataset dataset = load_dataset(dataset_folder, split="train")

LarkLeeOnePiece commented 6 months ago

Sorry, can you solve the mismatched_sizes problem after adding"trainer.train() trainer.save_model("SaveOutputFolder") trainer.Tokenizer.save_pretrained("SaveOutputFolder") trainer.save_state()" I met the same prblem, could you help me out? Do you use the AutoTokenizer.from_pretrained and AutoModelForCausalLM.from_pretrained to load the model and tokenier?

A-Janj commented 5 months ago

Sorry, can you solve the mismatched_sizes problem after adding"trainer.train() trainer.save_model("SaveOutputFolder") trainer.Tokenizer.save_pretrained("SaveOutputFolder") trainer.save_state()" I met the same prblem, could you help me out? Do you use the AutoTokenizer.from_pretrained and AutoModelForCausalLM.from_pretrained to load the model and tokenier?

I then used the following code for inference:

from transformers import AutoTokenizer, AutoModelForCausalLM import torch

tokenizer = AutoTokenizer.from_pretrained("SaveOutputFolder", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("SaveOutputFolder", ignore_mismatched_sizes=True, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128, top_k=50, top_p=0.95, do_sample=True, temperature=0.9, # Adjust as needed repetition_penalty=1.2, # Penalize repeated tokens no_repeat_ngram_size=2, # Prevent repeating n-grams num_return_sequences=1)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

A-Janj commented 5 months ago

fine_tune_tiny_llama.txt Yes, you can use the script to finetune our model just like you are using other llama models.

Hi, that's great, thanks. I can't seem to find documentation on how to format the custom dataset. For both Llama2 and Tinyllama I have formatted it as a csv where there is a 'text' column. Is there a similar format that I can follow for deepseek-coder-1.3b? The formats of each row are: Llama2: [INST] prompt [/INST] Llama2 answer <\s> Tinyllama: <|user|> prompt <|assistant|> Tinyllama answer I then load the dataset like this: from datasets import load_dataset dataset = load_dataset(dataset_folder, split="train")

this is the sample data set format for deepseek coder: https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1

the .json file should contain data like: [ { "instruction": "give python syntax in a Nutshell", "output": "Row1" }, { "instruction": "Print the content in between the curly brackets to the template output", "output": "Row2" }, { "instruction": "Statements of the Jinja language that do not have an output.", "output": "Row3" } ]