Closed pcserviceburgas closed 1 year ago
The xformers version that auto now installs by default is not compatible with training, plus cu117 has worst training results as well. Join the discord if you can. d8 dreambooth discord
be sure to be with the venv active
pip uninstall torch torchvision
pip uninstall torchaudio
pip uninstall xformers
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/torch13/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
Thank you for your response. There's a errors while run command
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
open-clip-torch 2.7.0 requires protobuf==3.20.0, but you have protobuf 3.19.6 which is incompatible.
clean-fid 0.1.29 requires requests==2.25.1, but you have requests 2.28.2 which is incompatible.
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
Thank you for your response. There's a errors while run command
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. open-clip-torch 2.7.0 requires protobuf==3.20.0, but you have protobuf 3.19.6 which is incompatible. clean-fid 0.1.29 requires requests==2.25.1, but you have requests 2.28.2 which is incompatible.
i think those warning are unrelated, check if the installation was completed
check if the installation was completed
Yes, it's successful. But those dependencies...
Training works for now on 15% but with slower speed, anyway thank you for solution!
well if faster means NaN, i rather train slowly
if faster means NaN
Of cource no! The speed was 2.72it/s before NaN, sometimes even up to 40-50% of training. Now it 1.64it/s, but you're right:
rather train slowly
The xformers version that auto now installs by default is not compatible with training, plus cu117 has worst training results as well. Join the discord if you can. d8 dreambooth discord
be sure to be with the venv active
pip uninstall torch torchvision
pip uninstall torchaudio
pip uninstall xformers
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/torch13/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
thinks a lot!! i've tried lots of time to solve this problem, I'd ever thought it's my dataset's problem or my graphics card broke down ...
Is this purely a bug with xformers then, or does a fix need to be made to this extension in how it selects the version to install?
I was able to work around this issue for the past couple of days by switching memory attention to flash_attention, but I'll try @Zuxier's fix above. Out of curiosity, does using xformers change the training results, or is it just a memory optimization technique? I'm wondering if I should retry some of the (not too successful) training approaches after using this fix.
Sorry for the spam, but I've been speculating that https://github.com/facebookresearch/xformers/issues/631 is describing this issue on the xformers side. It would see to explain how the training starts off okay, but then reliably bombs (presumably when the backward pass starts, but I'm not knowledgeable enough to say that with confidence).
Sorry for the spam, but I've been speculating that facebookresearch/xformers#631 is describing this issue on the xformers side. It would see to explain how the training starts off okay, but then reliably bombs (presumably when the backward pass starts, but I'm not knowledgeable enough to say that with confidence).
I have build a good amount of xformers 0.14dev0, 0.14, 0.15, 0.16. 0.14dev0 is only one that can actually train everything else is either no train or NaN. I think the issue is probably related to the fact that xformers is usually tested on cards that don't have consumer cards limitation, shared memory mainly. So that issue didn't really show up
I have build a good amount of xformers 0.14dev0, 0.14, 0.15, 0.16.
Have you tested xformers-0.0.16 with torch+cu118?
I have build a good amount of xformers 0.14dev0, 0.14, 0.15, 0.16.
Have you tested xformers-0.0.16 with torch+cu118?
yes
[+] xformers version 0.0.16rc425 installed.
I have a Quadro RTX A4000 and my speed is 2.0 it/s
Today tested xformers 0.0.16rc425 in WSL and haven't Nan - it just work!
And speed....! 3.13
Steps: 86%|████████████████ | 1815/2100 [12:42<01:30, 3.13it/s, loss=0.26, lr=1e-6]
Looks like the problem is only on windows native
[+] xformers version 0.0.16rc425 installed.
I have a Quadro RTX A4000 and my speed is 2.0 it/s Today tested xformers 0.0.16rc425 in WSL and haven't Nan - it just work! And speed....! 3.13
Steps: 86%|████████████████ | 1815/2100 [12:42<01:30, 3.13it/s, loss=0.26, lr=1e-6]
Looks like the problem is only on windows native
People have issues on Linux too, please double check
This issue is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days
xformers 0.0.17.dev442 is reported as working.
How do I upgrade to xformers 0.0.17.dev442 on automatic? I tried pip install 0.0.17.dev442 - it says it installed but when I lauch the webuser.bat it has the older version still showing. Tried to deleting VENV folders and same thing
It's a known issue of xformers library on 30x GPUs, try install the dev version of the xformers. This works on my 3060.
pip install xformers==0.0.17.dev466 See https://github.com/facebookresearch/xformers/issues/631
xformer==0.0.21 + cuda11.7 is OK? anyone verify it?
Kindly read the entire form below and fill it out with the requested information.
Please find the following lines in the console and paste them below. If you do not provide this information, your issue will be automatically closed.
` Python revision: 3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)] Dreambooth revision: 9f4d931a319056c537d24669cb950d146d1537b0 SD-WebUI revision: 15e89ef0f6f22f823c19592a401b9e4ee477258c
Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.16rc425 installed. [+] torch version 1.13.1+cu117 installed. [+] torchvision version 0.14.1+cu117 installed. `
Have you read the Readme? Yes Have you completely restarted the stable-diffusion-webUI, not just reloaded the UI? Yes Have you updated Dreambooth to the latest revision? Yes Have you updated the Stable-Diffusion-WebUI to the latest version? Yes No, really. Please save us both some trouble and update the SD-WebUI and Extension and restart before posting this. Reply 'OK' Below to acknowledge that you did this. OK Describe the bug Training starts normally. After some steps/epochs generated preview images is black and in output
loss=nan, loss_avg=nan
(A clear and concise description of what the bug is)Provide logs Training starts normally like this
But then something goes wrong. Generated previews is black and output is
If a crash has occurred, please provide the entire stack trace from the log, including the last few log messages before the crash occurred. There's no crash. Environment
What OS? Windows If Windows - WSL or native? Native What GPU are you using? Tesla T4 16GB Screenshots/Config If the issue is specific to an error while training, please provide a screenshot of training parameters or the db_config.json file from /models/dreambooth/MODELNAME/db_config.json db_config.zip Tried fp16 and bf16 - with no luck