bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models
https://arxiv.org/abs/2308.07124
MIT License
431 stars 27 forks source link

Training Stuck with Unexpected Number of Epochs (fine-tuning /starcoder) #13

Closed tclxmeng-jia closed 1 year ago

tclxmeng-jia commented 1 year ago

Description

I'm encountering an issue while fine-tuning starcoder using the provided script. The training seems to be stuck, and I'm getting an unexpected number of epochs. Here's the log information:

Running training Num examples = 4,000 Num Epochs = 9,223,372,036,854,775,807 Instantaneous batch size per device = 1 Total train batch size (w. parallel, distributed & accumulation) = 4 Gradient Accumulation steps = 4 Total optimization steps = 1,000 Number of trainable parameters = 35,553,280

UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") {'loss': 1.1266, 'learning_rate': 0.0001, 'epoch': 0.0}

Running Evaluation Num examples: Unknown Batch size = 1

It appears that the training process is stuck and the number of epochs is unexpectedly high. I'm running the script with the following parameters:

python finetune.py \
    --model_path="bigcode/starcoder" \
    --dataset_name="ArmelR/guanaco-commits" \
    --seq_length 2048 \
    --max_steps 1000 \
    --batch_size 1 \
    --input_column_name="content" \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-4 \
    --lr_scheduler_type="cosine"\
    --log_freq 1 \
    --eval_freq 1 \
    --num_warmup_steps 5 \
    --save_freq 5 \
    --weight_decay 0.05 \
    --output_dir="/root/autodl-fs/checkpoints"\
    --no_fp16 \
    --streaming

I'm not sure why the number of epochs is unexpectedly high and why the training process is not progressing as expected. Could you please help me understand what might be causing this issue?

ArmelRandy commented 1 year ago

Hi @tclxmeng-jia. About the unexpectedly high number of epochs, don't worry about this because it happens when transformers is not able to precisely compute the number of epochs. It is the case here because you did load the dataset in streaming. Trainer does not know in advance how many samples there are in the dataset, so it can not compute the precise number of epochs. Moreover, the streaming argument was thought for large datasets, and in such cases, the validation set is generated on by taking the first size_valid_set samples. You should not use this because I think you use a pretty small dataset that is already divided into train and test. Also, do you have a GPU? Launching your code with python is not advisable here. And finally, I am not sure that this dataset has a column content.

Can you check all these things?

tclxmeng-jia commented 1 year ago

@ArmelRandy Thank you for your guidance. I will make the necessary adjustments and conduct a thorough review based on your suggestions. It appears I overlooked utilizing the GPU, and I appreciate your reminder. Once again, thank you for your assistance.