Open mistersprinklez opened 1 year ago
Before the UNet error, I get this text encoder error:
Training the text encoder...
2023-01-03 05:16:10.985527: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
'########:'########:::::'###::::'####:'##::: ##:'####:'##::: ##::'######:::
... ##..:: ##.... ##:::'## ##:::. ##:: ###:: ##:. ##:: ###:: ##:'##... ##::
::: ##:::: ##:::: ##::'##:. ##::: ##:: ####: ##:: ##:: ####: ##: ##:::..:::
::: ##:::: ########::'##:::. ##:: ##:: ## ## ##:: ##:: ## ## ##: ##::'####:
::: ##:::: ##.. ##::: #########:: ##:: ##. ####:: ##:: ##. ####: ##::: ##::
::: ##:::: ##::. ##:: ##.... ##:: ##:: ##:. ###:: ##:: ##:. ###: ##::: ##::
::: ##:::: ##:::. ##: ##:::: ##:'####: ##::. ##:'####: ##::. ##:. ######:::
:::..:::::..:::::..::..:::::..::....::..::::..::....::..::::..:::......::::
0% 0/250 [00:00<?, ?it/s] Jabba Jabba Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 852, in
Invoked with: tensor([[[-0.2954, -0.0532, -0.3613, ..., -0.1678, 0.3162, 0.3679]],
[[-0.3757, -0.0265, 0.7148, ..., 0.3625, 0.1262, 0.2776]],
[[-0.4910, -0.2644, -0.0623, ..., -0.0681, 0.0359, 0.6270]],
...,
[[ 0.0436, -0.6108, 0.0047, ..., 0.2971, 0.4290, -0.7031]],
[[-0.0545, -0.7798, -0.5498, ..., -0.0966, 0.4048, -0.6187]],
[[-0.2437, -0.6924, -0.2314, ..., -0.1779, -0.0747, -0.7769]]],
device='cuda:0', dtype=torch.float16, requires_grad=True), tensor([[[ 0.0975, 0.3992, -0.7261, ..., -0.4883, -0.1637, -0.6479]],
[[ 0.2832, 0.9229, -0.2194, ..., 0.0740, -0.1065, -0.6523]],
[[ 0.1605, 0.6011, -0.5474, ..., -0.0182, -0.0898, -0.6641]],
...,
[[-0.5063, 0.0097, 0.0425, ..., 0.7388, -0.3315, 1.5195]],
[[-0.3940, 0.1415, -0.2974, ..., 0.2842, -0.1648, 1.1846]],
[[-0.9570, 0.2820, 0.3958, ..., 0.3896, -0.3459, 1.3447]]],
device='cuda:0', dtype=torch.float16, requires_grad=True), tensor([[[-2.1500e-02, -1.1377e-01, -2.2961e-01, ..., -5.0293e-01,
-6.4087e-02, 1.6577e-01]],
[[ 3.5718e-01, 1.1774e-01, 5.9277e-01, ..., -8.3447e-05,
-9.0637e-02, 1.6846e-01]],
[[ 3.6792e-01, 1.3452e-01, 6.2402e-01, ..., 6.7291e-03,
7.9468e-02, -2.2461e-01]],
...,
[[ 3.9575e-01, -1.2024e-01, 2.8442e-02, ..., 2.5977e-01,
4.3335e-01, -3.2544e-01]],
[[ 2.4597e-01, 9.6741e-02, 4.4824e-01, ..., 1.0078e+00,
1.1797e+00, -8.5010e-01]],
[[ 2.2937e-01, -2.7832e-01, -8.7097e-02, ..., 7.4829e-02,
-3.4570e-01, -4.7046e-01]]], device='cuda:0', dtype=torch.float16,
requires_grad=True), tensor([[[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.]],
...,
[[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.]]], device='cuda:0',
dtype=torch.float16), tensor([ 0, 64, 128, 192, 256, 320, 384, 448, 512, 576, 640, 704,
768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280], device='cuda:0',
dtype=torch.int32), tensor([ 0, 64, 128, 192, 256, 320, 384, 448, 512, 576, 640, 704,
768, 832, 896, 960, 1024, 1088, 1152, 1216, 1280], device='cuda:0',
dtype=torch.int32), 64, 64, 0.0, 0.125, False, False, False, 0, None
0% 0/250 [00:02<?, ?it/s]
Did some troubleshooting and can see now that the notebook can train while my GPU is set to standard. Would love to figure out the issue so I can train on Premium again!
Are you using the latest colab ?
Are you using the latest colab ?
Yes, can confirm this happens on the latest notebook.
Thank you for your response! I’m using the link from your git page:
On Jan 3, 2023, at 3:44 AM, Ben @.***> wrote:
Are you using the latest colab ?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
Yes, from the latest notebook!
any idea what may be going on ? @eds123
T4 or A100 ?
Same here, A100, latest Dreambooth colab, v756 model.
Both Text & Unet training produce same errors on an A100 GPU - Premium Colab. Confirmed that when using Standard GPU, T4 - training works. Latest Colabs.
with the premium gpu, run this :
!pip uninstall -y -q xformers
!pip install ninja
!pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
%cd /content
!zip -r A100 /usr/local/lib/python3.8/dist-packages/xformers
!cp A100.zip /content/gdrive/MyDrive
Then send me the link to the A100.zip
Here's the link to the A100.zip produced from running the script. https://drive.google.com/file/d/1NApnb3CiUrvRB7si-SVIB25X92v10mnd/view?usp=share_link
great thanks !
!pip uninstall -y -q xformers !pip install ninja !pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers %cd /content !zip -r A100 /usr/local/lib/python3.8/dist-packages/xformers !cp A100.zip /content/gdrive/MyDrive
Hey! So in order to run premium gpus we have to send you this A100.zip file?
He already fixed it on a previous update. I am now able to run on premium GPUs. Thank you!
He already fixed it on a previous update. I am now able to run on premium GPUs. Thank you!
Great!
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 852, in
main()
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 719, in main
accelerator.backward(loss)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 882, in backward
self.scaler.scale(loss).backward(*kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, args)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 414, in wrapper
outputs = fn(ctx, args)
File "/usr/local/lib/python3.8/dist-packages/xformers/ops/fmha/init.py", line 111, in backward
grads = _memory_efficient_attention_backward(
File "/usr/local/lib/python3.8/dist-packages/xformers/ops/fmha/init.py", line 381, in _memory_efficient_attention_backward
grads = op.apply(ctx, inp, grad)
File "/usr/local/lib/python3.8/dist-packages/xformers/ops/fmha/flash.py", line 339, in apply
cls.OPERATOR(
File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 442, in call
return self._op(args, **kwargs or {})
File "/usr/local/lib/python3.8/dist-packages/xformers/ops/fmha/flash.py", line 96, in _flash_bwd
_C_flashattention.bwd(
TypeError: bwd(): incompatible function arguments. The following argument types are supported:
Invoked with: tensor([[[-1.1425e-03, -1.4229e-03, 3.8314e-04, ..., -8.2922e-04, 3.0117e-03, 1.0176e-03]],
0% 0/3000 [00:02<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--stop_text_encoder_training=250', '--image_captions_filename', '--train_only_unet', '--save_starting_step=500', '--save_n_steps=0', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/Jabba2', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/Jabba2/instance_images', '--output_dir=/content/models/Jabba2', '--captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/Jabba2/captions', '--instance_prompt=', '--seed=574824', '--resolution=512', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=1e-05', '--lr_scheduler=polynomial', '--lr_warmup_steps=0', '--max_train_steps=3000']' returned non-zero exit status 1.
Something went wrong