Training gone wrong with loss=nan, loss_avg=nan after some steps

pcserviceburgas commented 1 year ago

Kindly read the entire form below and fill it out with the requested information.

Please find the following lines in the console and paste them below. If you do not provide this information, your issue will be automatically closed.

` Python revision: 3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)] Dreambooth revision: 9f4d931a319056c537d24669cb950d146d1537b0 SD-WebUI revision: 15e89ef0f6f22f823c19592a401b9e4ee477258c

Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.16rc425 installed. [+] torch version 1.13.1+cu117 installed. [+] torchvision version 0.14.1+cu117 installed. `

Have you read the Readme? Yes Have you completely restarted the stable-diffusion-webUI, not just reloaded the UI? Yes Have you updated Dreambooth to the latest revision? Yes Have you updated the Stable-Diffusion-WebUI to the latest version? Yes No, really. Please save us both some trouble and update the SD-WebUI and Extension and restart before posting this. Reply 'OK' Below to acknowledge that you did this. OK Describe the bug Training starts normally. After some steps/epochs generated preview images is black and in output loss=nan, loss_avg=nan (A clear and concise description of what the bug is)

Provide logs Training starts normally like this

Steps:   1%|▌                     | 36/4200 [00:14<25:29,  2.72it/s, loss=0.00933, loss_avg=0.0995, lr=1e-6, vram_usage=9.8]

But then something goes wrong. Generated previews is black and output is

Steps:   4%|██                   |158/4200 [00:57<24:39,  2.73it/s, loss=nan, loss_avg=nan, lr=1e-6, vram_usage=11.8]

If a crash has occurred, please provide the entire stack trace from the log, including the last few log messages before the crash occurred. There's no crash. Environment

What OS? Windows If Windows - WSL or native? Native What GPU are you using? Tesla T4 16GB Screenshots/Config If the issue is specific to an error while training, please provide a screenshot of training parameters or the db_config.json file from /models/dreambooth/MODELNAME/db_config.json db_config.zip Tried fp16 and bf16 - with no luck

Zuxier commented 1 year ago

The xformers version that auto now installs by default is not compatible with training, plus cu117 has worst training results as well. Join the discord if you can. d8 dreambooth discord

be sure to be with the venv active

pip uninstall torch torchvision pip uninstall torchaudio pip uninstall xformers pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116 pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/torch13/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

pcserviceburgas commented 1 year ago

pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116

Thank you for your response. There's a errors while run command

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
open-clip-torch 2.7.0 requires protobuf==3.20.0, but you have protobuf 3.19.6 which is incompatible.
clean-fid 0.1.29 requires requests==2.25.1, but you have requests 2.28.2 which is incompatible.

Zuxier commented 1 year ago

pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116

Thank you for your response. There's a errors while run command

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
open-clip-torch 2.7.0 requires protobuf==3.20.0, but you have protobuf 3.19.6 which is incompatible.
clean-fid 0.1.29 requires requests==2.25.1, but you have requests 2.28.2 which is incompatible.

i think those warning are unrelated, check if the installation was completed

pcserviceburgas commented 1 year ago

check if the installation was completed

Yes, it's successful. But those dependencies...

Training works for now on 15% but with slower speed, anyway thank you for solution!

Zuxier commented 1 year ago

well if faster means NaN, i rather train slowly

pcserviceburgas commented 1 year ago

if faster means NaN

Of cource no! The speed was 2.72it/s before NaN, sometimes even up to 40-50% of training. Now it 1.64it/s, but you're right:

rather train slowly

lynfield commented 1 year ago

The xformers version that auto now installs by default is not compatible with training, plus cu117 has worst training results as well. Join the discord if you can. d8 dreambooth discord

be sure to be with the venv active

pip uninstall torch torchvision pip uninstall torchaudio pip uninstall xformers pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116 pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/torch13/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

thinks a lot!! i've tried lots of time to solve this problem, I'd ever thought it's my dataset's problem or my graphics card broke down ...

jlee2109 commented 1 year ago

Is this purely a bug with xformers then, or does a fix need to be made to this extension in how it selects the version to install?

I was able to work around this issue for the past couple of days by switching memory attention to flash_attention, but I'll try @Zuxier's fix above. Out of curiosity, does using xformers change the training results, or is it just a memory optimization technique? I'm wondering if I should retry some of the (not too successful) training approaches after using this fix.

jlee2109 commented 1 year ago

Sorry for the spam, but I've been speculating that https://github.com/facebookresearch/xformers/issues/631 is describing this issue on the xformers side. It would see to explain how the training starts off okay, but then reliably bombs (presumably when the backward pass starts, but I'm not knowledgeable enough to say that with confidence).

Zuxier commented 1 year ago

Sorry for the spam, but I've been speculating that facebookresearch/xformers#631 is describing this issue on the xformers side. It would see to explain how the training starts off okay, but then reliably bombs (presumably when the backward pass starts, but I'm not knowledgeable enough to say that with confidence).

I have build a good amount of xformers 0.14dev0, 0.14, 0.15, 0.16. 0.14dev0 is only one that can actually train everything else is either no train or NaN. I think the issue is probably related to the fact that xformers is usually tested on cards that don't have consumer cards limitation, shared memory mainly. So that issue didn't really show up

pcserviceburgas commented 1 year ago

I have build a good amount of xformers 0.14dev0, 0.14, 0.15, 0.16.

Have you tested xformers-0.0.16 with torch+cu118?

Zuxier commented 1 year ago

I have build a good amount of xformers 0.14dev0, 0.14, 0.15, 0.16.

Have you tested xformers-0.0.16 with torch+cu118?

yes

sebaxakerhtc commented 1 year ago

[+] xformers version 0.0.16rc425 installed.

I have a Quadro RTX A4000 and my speed is 2.0 it/s Today tested xformers 0.0.16rc425 in WSL and haven't Nan - it just work! And speed....! 3.13 Steps: 86%|████████████████ | 1815/2100 [12:42<01:30, 3.13it/s, loss=0.26, lr=1e-6]

Looks like the problem is only on windows native

Zuxier commented 1 year ago

[+] xformers version 0.0.16rc425 installed.

I have a Quadro RTX A4000 and my speed is 2.0 it/s Today tested xformers 0.0.16rc425 in WSL and haven't Nan - it just work! And speed....! 3.13 Steps: 86%|████████████████ | 1815/2100 [12:42<01:30, 3.13it/s, loss=0.26, lr=1e-6]

Looks like the problem is only on windows native

People have issues on Linux too, please double check

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days

openconcerto commented 1 year ago

xformers 0.0.17.dev442 is reported as working.

rikabi89 commented 1 year ago

How do I upgrade to xformers 0.0.17.dev442 on automatic? I tried pip install 0.0.17.dev442 - it says it installed but when I lauch the webuser.bat it has the older version still showing. Tried to deleting VENV folders and same thing

litaotju commented 1 year ago

It's a known issue of xformers library on 30x GPUs, try install the dev version of the xformers. This works on my 3060.

pip install xformers==0.0.17.dev466 See https://github.com/facebookresearch/xformers/issues/631

PeiqinSun commented 12 months ago

xformer==0.0.21 + cuda11.7 is OK? anyone verify it?

d8ahazard / sd_dreambooth_extension

Training gone wrong with loss=nan, loss_avg=nan after some steps #859

Kindly read the entire form below and fill it out with the requested information.