Issue with validation: good validation_prompt, or other issues?

Hi there, I have a question about validation in the training script. I tried to enable validation every few epochs, by providing --validation_prompt="a perfect mirrored reflective chrome ball sphere". However the training loss does not meaningfully decrease during training, and the generated images in validation seem to collapse first but converge to generating chrome balls with images of fall foliage in the background. I wonder if this is an issue with wrong validation prompt, or something else? Thanks!

Initial generated image in validation, without training:

Generated image after 1 epoch:

Generated images after 19 epochs with random seeds (other than 0):

Training loss of each iteration:

No you didn't do anything wrong.

Actually, we didn't use the tensorboard while training. The tensorboard is a part of code from Huggingface that we built-on.

The method that we validation and pick the best hyper-parameter is to train LoRA into multiple checkpoint, going through entire light estimation process and render it with Stylelight evaluation code.

But let's back to what's show on the tensorboard.

1. Collaspe image.

Our training loss is actually update the model with the loss inside the chromeball only. So, it possible to collapse the image outside the chromeball because it didn't loss apply on those pixel.

2. Increasing training loss

The loss in the training step is meaningless (for us). It didn't tell how good or bad of training. just didn't fall to NaN is good enough for me.

There are 2 factor that make training loss look so random

Timestep (0 < t < 1000): On training step timestep is randomly pick. At the high timestep (t close to 1000) the image is almost completely noise while the low timestep (t close to 0) image is almost clear. So, the loss is swing depend on the training step.
Training target: the training loss is L2 to the gaussain noise. So, the training loss here is tell how close of the output to the gaussain noise which i think it cannot tell how well of reconstructed chromeball.

DiffusionLight / DiffusionLight-LoRA-Trainer

Issue with validation: good validation_prompt, or other issues? #2

1. Collaspe image.

2. Increasing training loss