Closed suhaneshivam closed 11 months ago
hi! do you use the release branch, or main? things are in flux on main branch right now, but i still wouldn't have expected this..
black images happen when loss goes to infinity during training. i dont think you changed the vae that is in use, right?
oh, you must set rescale betas to False for vanilla SDXL
I am not sure where exactly I need to set this parameter to False. Do I need to set this parameter to False in env file and then need to retrain or I have pass it during inference. I also tried changing the vae but still getting same result.
pipeline.scheduler = DDIMScheduler.from_pretrained(
model_id,
subfolder="scheduler",
rescale_betas_zero_snr=True,
timestep_spacing="trailing",
)
here in this scheduler invocation you are setting rescale_betas_zero_snr=True
the env file seems to have it disabled:
## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
i would recommend keeping it disabled for SDXL unless you want to go through about 50,000 steps of training on a few million images to overhaul the whole noise schedule :D
i think you just have to set that value to False during inference time, i don't think retraining is necessary.
I set the parameter rescale_betas_zero_snr
to False
inside scheduler but still getting the black images with all nan
values. Although I am getting expected images when I trained the model again with --mixed_precision=no
, keeping other settings unchanged. I think, this has to do with the warning
RuntimeWarning: invalid value encountered in cast
images = (images * 255).round().astype("uint8")
which I get every time I run inference.
what were your loss values during training?
you could also greatly simplify the example and not use Compel for prompt handling, but instead using the prompt
and negative_prompt
inputs directly.
just initialise the pipeline = ...from_pretrained('/path/to/model')
and allow it to fully pick up the model config and its default scheduler, and everything. that will likely use Euler and be more reliable.
I initially set the train epochs to be 25 but checked the result after 5 epochs. When I realised that It was generating blank images then I terminated the training job. Then I re-run the training to save the pipeline for another single epoch and ran inference using pipeline = ...from_pretrained('/path/to/model')
without compel. Still getting the same result.
I checked the loss and it turned out that It was NaN
through out the training.
:cry: that is never fun. do you have any debug logs for that session?
Sure! Not debug logs though. https://drive.google.com/file/d/1eC4qhH2V2lfFm-y3ZLVkbMNvm6k32NTX/view?usp=drive_link
those state i need access. if you can recreate the issue easily, please do so with
SIMPLETUNER_LOG_LEVEL=DEBUG
in your env file.
bash train_sdxl.sh > train.log 2>&1
and then provide train.log here, with whatever info redacted as needed.
I will provide you with logs soon.
Here are the debug logs.
https://drive.google.com/file/d/1dzynYWKaA1J5wGzavKz16yrYA5xQf-gV/view?usp=drive_link
I was able to get the expected results when I trained with mixed_precision=no
option.
Also, It did not save vae cache in cache_vae directory so I had to tweak the script to generate it at runtime.
please try reproducing this on v0.8.0
Hi, I have trained the model for 5 epochs and when I am doing inference using the saved checkpoints, All I am getting is blank images: I am using this python code:
I am also gettting this warning:
in the end: The env file which I used for training :
My requirements.txt
nvidia-smi output