devilismyfriend / StableTuner

Finetuning SD in style.
GNU Affero General Public License v3.0
670 stars 51 forks source link

Incredibly Frustrating Bug - Training model collapses due to tkinter #93

Open Claxiz opened 1 year ago

Claxiz commented 1 year ago

With some certain settings, not sure what contributes to it, this error prints out:

Weights saved to C:/AI/StableTuner/models/1osvgA\epoch_80 _Steps To Epoch: 33%|██████████████████████▎ | 4/12 [00:08<00:16, 2.05s/it]Using [00:22<00:00, 1.98s/it]Using FlashAttention|█████████████████████████████████████▊ | 1032/1200 [47:35<05:32, 1.98s/it, loss=nan, lr=5e-6] Overall Epochs: 86%|███████████████████████████████████████████████████████▉ | 86/100 [47:35<06:58, 29.89s/it]C:\ProgramData\anaconda3\envs\ST\lib\site-packages\diffusers\pipeline_utils.py:788: :\ProgramData\anaconda3\envs\ST\lib\site-packages\diffusers\pipelineutils.py:788: RuntimeWarning: invalid value encountered in cast images = (images * 255).round().astype("uint8")

Training proceeds to continue with loss going from normal loss ranges to loss=nan, until training finishes, when this error appears:

_bgerror failed to handle background error. Original error: invalid command name "1414340073536update" Error in bgerror: can't invoke "tk" command: application has been destroyed bgerror failed to handle background error. Original error: invalid command name "1414484413504_click_animation" Error in bgerror: can't invoke "tk" command: application has been destroyed bgerror failed to handle background error. Original error: invalid command name "1414523277696check_dpiscaling" Error in bgerror: can't invoke "tk" command: application has been destroyed warning: redirecting to https://github.com/devilismyfriend/StableTuner.git/ Latest git hash: ef51982

This is everything for the traceback. Training session was started using fp32, alongside these settings- Capture

The ultimate effect of this error causes the model being trained to collapse, breaking everything after the failed epochs. Once this happens, trying to use one of these models in something like the webui causes generations to fail on startup, with errors requesting "Upcast cross attention layer to float32" to be turned on in settings, and a commandline args change. If the model is loaded with appropriate settings in the webui as the error requests, generations only result in black images.

devilismyfriend commented 1 year ago

Looks like one of your images might be corrupted, tkinter has nothing to do with this, it's unloaded during training and reloaded after it finishes or fails