[Lecture10][0509] Any personal tips for finding where the high loss came from?

choigiheon commented 1 month ago

Hello, @yjyoo3312. I have a question about training the model.

When training AI model, many factors besides the model architecture determine its performance. It can be challenging to identify whether high loss is due to issues with the architecture itself or the training setting(including the optimizer, hyperparameters, learning rate scheduler, or training duration)

Very often I can't decide if the architecture is a problem or if the training setting is a problem.

Are there any personal tips for finding where the high loss came from??

Thank you!!

yjyoo3312 commented 1 month ago

@choigiheon Thank you for the question, which is very important.

In my experience, a well-configured dataset and standard training settings, such as a batch size between 16 and 64 and a learning rate between 1e-3 to 1e-5 using the Adam optimizer, usually (at least) allow models to fit the training data, leading to converging training losses. If the training loss converges, the test loss typically decreases, too, assuming the dataset is well-configured. (Or it will say our approach is far from the problem we should solve)

If the training loss doesn't converge, it's important to check the training dataset for label noise or potential dataloader bugs. If the training loss improves but the test loss doesn't, examine the test dataloader and dataset configurations to ensure they're not substantially different from the training set or incorrectly labeled.

Finally, review the model architecture and any modifications, such as changes in the loss function. Starting with a baseline model, gradually integrate new ideas, checking at each step to identify where problems might arise. For academic publications, I recommend researchers adhere to conventional training settings specific to their field, reserving fine-tuning of these settings for the final step.

I found this incremental approach often proves faster and more efficient in debugging.

choigiheon commented 1 month ago

Wow, thank you!! It is very very helpful for me. I've been always confused about what's wrong with it, but I think I should approached it systematically.

PiLab-CAU / ComputerVision-2401

[Lecture10][0509] Any personal tips for finding where the high loss came from? #26