It seems the training is stuck

James-QiuHaoran / LLM-serving-with-proxy-models

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Apache License 2.0

17 stars 5 forks source link

It seems the training is stuck #4

Closed Aston-zeal closed 5 months ago

Aston-zeal commented 5 months ago

It seems the training is stuck

James-QiuHaoran commented 5 months ago

Would you like to post the command that you executed causing the issue?

saeid93 commented 5 months ago

@Aston-zeal I thought the same at first but after looking at the code and putting a counter on the training loop minibatches I saw that it is actually progressing. Since the tqdm is on the epochs it will take a long time to observe the progress bar moving. Probably you are having the same issue and by putting a counter on the training loop you can observe the progress.

Aston-zeal commented 5 months ago

Yes, it's just that the training is too slow, I put it on the CPU

James-QiuHaoran commented 5 months ago

Yeah, CPU would be too slow. You can think of it as a fine-tuning process for a BERT model.

You can also try to limit the data size by adding the --data_size flag.

For example:

# data generation -> 1K data
python preprocess_dataset.py --task_type 0 --data_size 1

# predictor training (regression with MSE loss)
python latency_prediction.py --task_type 0 --data_size 1

# predictor training (regression with L1 loss)
python latency_prediction.py --task_type 0 --l1_loss --data_size 1