TheLastBen / fast-stable-diffusion

fast-stable-diffusion + DreamBooth
MIT License
7.49k stars 1.31k forks source link

error starting training textual inversion #1604

Open Rudy34160 opened 1 year ago

Rudy34160 commented 1 year ago

Since last night, I have the error below when I launch a training. Never had a problem until now... Any idea what I'm doing wrong?

Training at rate of 0.9 until step 1000 Preparing dataset... 50% 10/20 [00:05<00:05, 1.89it/s] Error completing request Arguments: ('task(dvl5eia1y9q5fxe)', 'testai01', '0.9', 10, 1, '/content/gdrive/MyDrive/sd/txt/', 'textual_inversion', 512, 512, True, 1000, 'disabled', '0.1', False, 0, 'deterministic', 20, 20, 'humansubject_filewords.txt', True, True, 'portrait', '', 35, 14, 7, -1.0, 512, 512) {} Traceback (most recent call last): File "/content/gdrive/MyDrive/sd/stable-diffusion-webui/modules/call_queue.py", line 56, in f res = list(func(*args, kwargs)) File "/content/gdrive/MyDrive/sd/stable-diffusion-webui/modules/call_queue.py", line 37, in f res = func(*args, kwargs) File "/content/gdrive/MyDrive/sd/stable-diffusion-webui/modules/textual_inversion/ui.py", line 33, in train_embedding embedding, filename = modules.textual_inversion.textual_inversion.train_embedding(args) File "/content/gdrive/MyDrive/sd/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 413, in train_embedding ds = modules.textual_inversion.dataset.PersonalizedBase(data_root=data_root, width=training_width, height=training_height, repeats=shared.opts.training_image_repeats_per_epoch, placeholder_token=embedding_name, model=shared.sd_model, cond_model=shared.sd_model.cond_stage_model, device=devices.device, template_file=template_file, batch_size=batch_size, gradient_step=gradient_step, shuffle_tags=shuffle_tags, tag_drop_out=tag_drop_out, latent_sampling_method=latent_sampling_method, varsize=varsize) File "/content/gdrive/MyDrive/sd/stable-diffusion-webui/modules/textual_inversion/dataset.py", line 88, in init latent_dist = model.encode_first_stage(torchdata.unsqueeze(dim=0)) File "/content/gdrive/MyDrive/sd/stable-diffusion-webui/modules/sd_hijack_utils.py", line 17, in setattr(resolved_obj, func_path[-1], lambda args, kwargs: self(*args, kwargs)) File "/content/gdrive/MyDrive/sd/stable-diffusion-webui/modules/sd_hijack_utils.py", line 28, in call return self.orig_func(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/content/gdrive/MyDrive/sd/stablediffusion/ldm/models/diffusion/ddpm.py", line 830, in encode_first_stage return self.first_stage_model.encode(x) File "/content/gdrive/MyDrive/sd/stablediffusion/ldm/models/autoencoder.py", line 83, in encode h = self.encoder(x) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/content/gdrive/MyDrive/sd/stablediffusion/ldm/modules/diffusionmodules/model.py", line 526, in forward h = self.down[i_level].block[i_block](hs[-1], temb) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/content/gdrive/MyDrive/sd/stablediffusion/ldm/modules/diffusionmodules/model.py", line 131, in forward h = self.norm1(h) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/normalization.py", line 273, in forward return F.group_norm( File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2528, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 14.76 GiB total capacity; 6.37 GiB already allocated; 7.27 GiB free; 6.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF__

TheLastBen commented 1 year ago

Out of VRAM

Rudy34160 commented 1 year ago

Out of VRAM

I had deduced the same type of error from the message. But it's still surprising on Google Collab, right? I have never encountered this type of problem until now while training my textual inversions. One of the parameters that I modified would be incorrectly adjusted and would consume too much Vram? Which ? 🤔

Rb-diff commented 1 year ago

Out of VRAM

I had deduced the same type of error from the message. But it's still surprising on Google Collab, right? I have never encountered this type of problem until now while training my textual inversions. One of the parameters that I modified would be incorrectly adjusted and would consume too much Vram? Which ? 🤔

at first glance I'd say the learning rate. 0.9 looks extremely high to me, also... is 10 your batch size?

Rudy34160 commented 1 year ago

Rb

Yes, 10 in batch size: Never had a crash with this setting. So indeed, it could come from the learning rate. Despite my many readings on this subject, I have not yet found any details on this subject: There is a limit? (I thought we were playing on a range between 0 and 1). Without further information, I fumble with empirical tests...

Rb-diff commented 1 year ago

hmm well colab has been fickle for the last day or so, so I don't know. I am also just learning but that was what stuck out to me as I have just never seen anyone do a learning rate of 0,9. mostly the ones I see are in the triple decimals but I don't know if this would affect the vram usage... unless it's like telling it to read the entirety of ten books at once and then do an oral bok review.

edit. just tried a run myself and for me colab/sd consistently crashes on any batch larger than two, and two is a fifty fifty chance it nosedives into a cuda vram crash

Rudy34160 commented 1 year ago

> [...] I have just never seen anyone do a learning rate of 0,9. That's why I want to test for myself. Having found no text on this subject, I want to see what it can give..

> edit. just tried a run myself and for me colab/sd consistently crashes on any batch larger than two, and two is a fifty fifty chance it nosedives into a cuda vram crash I do not master enough Colab to determine the cause of this type of crash... Someone more expert can perhaps enlighten us? Could just be a temporary technical fault

Rb-diff commented 1 year ago

I''m guessing colab is doing something behind the curtains, becvause it has been twitchy these last 24hrs

Rudy34160 commented 1 year ago

> edit. just tried a run myself and for me colab/sd consistently crashes on any batch larger than two, and two is a fifty fifty chance it nosedives into a cuda vram crash

No way... Same error on each of my attempts.... 😥 We'll wait for it to work again...

Rudy34160 commented 1 year ago

News from the front?

Anyone have another Colab to recommend for training embeddings while waiting for this to recover?

Rudy34160 commented 1 year ago

I''m guessing colab is doing something behind the curtains, becvause it has been twitchy these last 24hrs

Do you know where you can find a good tutorial to try to create your own version of SD Colab while waiting ?

TheLastBen commented 1 year ago

Try Runpod notebooks, they might not crash due to high VRAM req

Rudy34160 commented 1 year ago

Try Runpod notebooks, they might not crash due to high VRAM req

Not free for my test ..😁😉 Mais en attendant du mieux, j'exécute celui-ci : https://github.com/camenduru/stable-diffusion-webui-colab. They might not crash due to high VRAM.