Closed tclxmeng-jia closed 1 year ago
Hi @tclxmeng-jia. About the unexpectedly high number of epochs, don't worry about this because it happens when transformers is not able to precisely compute the number of epochs. It is the case here because you did load the dataset in streaming. Trainer does not know in advance how many samples there are in the dataset, so it can not compute the precise number of epochs. Moreover, the streaming
argument was thought for large datasets, and in such cases, the validation set is generated on by taking the first size_valid_set
samples. You should not use this because I think you use a pretty small dataset that is already divided into train
and test
. Also, do you have a GPU? Launching your code with python
is not advisable here. And finally, I am not sure that this dataset has a column content
.
Can you check all these things?
@ArmelRandy Thank you for your guidance. I will make the necessary adjustments and conduct a thorough review based on your suggestions. It appears I overlooked utilizing the GPU, and I appreciate your reminder. Once again, thank you for your assistance.
Description
I'm encountering an issue while fine-tuning
starcoder
using the provided script. The training seems to be stuck, and I'm getting an unexpected number of epochs. Here's the log information:Running training Num examples = 4,000 Num Epochs = 9,223,372,036,854,775,807 Instantaneous batch size per device = 1 Total train batch size (w. parallel, distributed & accumulation) = 4 Gradient Accumulation steps = 4 Total optimization steps = 1,000 Number of trainable parameters = 35,553,280
UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") {'loss': 1.1266, 'learning_rate': 0.0001, 'epoch': 0.0}
Running Evaluation Num examples: Unknown Batch size = 1
It appears that the training process is stuck and the number of epochs is unexpectedly high. I'm running the script with the following parameters:
I'm not sure why the number of epochs is unexpectedly high and why the training process is not progressing as expected. Could you please help me understand what might be causing this issue?