Time taken and Conditions for finishing training

Dionysus061726 commented 1 month ago

Hi! Thanks for your work! I encountered some problems during training that I would like to ask for some help:

I have been training on my dataset for about 10 days, and the training steps have reached 1.91e+06. How can I know if the model has finished training? Or does the training process continue indefinitely if I don’t stop it manually?

In other words, what are the conditions for the training to end? Where is this reflected in the code?

| grad_norm | 0.0099 | | loss | 0.00519 | | loss_q0 | 0.0155 | | loss_q1 | 4.72e-05 | | loss_q2 | 1.71e-05 | | loss_q3 | 8.83e-06 | | mse | 0.00519 | | mse_q0 | 0.0155 | | mse_q1 | 4.72e-05 | | mse_q2 | 1.71e-05 | | mse_q3 | 8.83e-06 | | param_norm | 939 | | samples | 1.15e+07 | | step | 1.91e+06 |

Dionysus061726 commented 1 month ago

Here is the loss fig

Figure_1

szh404 commented 4 weeks ago

Hi! Thanks for your work! I encountered some problems during training that I would like to ask for some help:

I have been training on my dataset for about 10 days, and the training steps have reached 1.91e+06. How can I know if the model has finished training? Or does the training process continue indefinitely if I don’t stop it manually?

In other words, what are the conditions for the training to end? Where is this reflected in the code?

| grad_norm | 0.0099 |

| loss | 0.00519 | | loss_q0 | 0.0155 | | loss_q1 | 4.72e-05 | | loss_q2 | 1.71e-05 | | loss_q3 | 8.83e-06 | | mse | 0.00519 | | mse_q0 | 0.0155 | | mse_q1 | 4.72e-05 | | mse_q2 | 1.71e-05 | | mse_q3 | 8.83e-06 | | param_norm | 939 | | samples | 1.15e+07 | | step | 1.91e+06 |

Hi 代码不会自动停止，除非你手动结束它 The code will not stop automatically unless you manually stop it.

对应的代码: guided_diffusion/train_utils.py文件

while (
                not self.lr_anneal_steps
                or self.step + self.resume_step < self.lr_anneal_steps
        ):

Dionysus061726 commented 4 weeks ago

@szh404 Thank you! I didn't read the article carefully where the author specified that his iter steps was 50000. It's my fault. I also found out later that it was the matter of the number of iteration.

I had no deep learning experience, so I was confused about this. Anyway, I'll keep learning.

Coordi777 / Conditional-Diffusion-for-SAR-to-Optical-Image-Translation

Time taken and Conditions for finishing training #10

| grad_norm | 0.0099 | | loss | 0.00519 | | loss_q0 | 0.0155 | | loss_q1 | 4.72e-05 | | loss_q2 | 1.71e-05 | | loss_q3 | 8.83e-06 | | mse | 0.00519 | | mse_q0 | 0.0155 | | mse_q1 | 4.72e-05 | | mse_q2 | 1.71e-05 | | mse_q3 | 8.83e-06 | | param_norm | 939 | | samples | 1.15e+07 | | step | 1.91e+06 |

| grad_norm | 0.0099 |