nan loss during training

magickaito commented 1 year ago

Hi guys I am using this colab notebook by pedrogengo

For unknown reasons, I keep getting nan loss during training. This happens whenever the training steps is higher than 500. If the steps is 500. It appears ok (but too low to be usable)

This happened on both my copy of google colab and a hosted runpod pytorch container with 24GB graphics memory.

These are the configurations:

PRETRAINED_MODEL="runwayml/stable-diffusion-v1-5"
PROMPT="a photo of wendy030305 man"
OUTPUT_DIR="output030305"
IMAGES_FOLDER_OPTIONAL="training_images"
RESOLUTION="512"
RESOLUTION=int(RESOLUTION)
STEPS = 1000
BATCH_SIZE = 1
FP_16 = True
LEARNING_RATE = 3e-4
TRAIN_TEXT_ENCODER = True
LEARNING_RATE_TEXT_ENCODER = 1e-5

Nothing much changed.

There are the output that shows loss becoming nan during the training:

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:231: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:336: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
Downloading (…)tokenizer/vocab.json: 100%|██| 1.06M/1.06M [00:03<00:00, 347kB/s]
Downloading (…)tokenizer/merges.txt: 100%|████| 525k/525k [00:01<00:00, 432kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████| 472/472 [00:00<00:00, 540kB/s]
Downloading (…)okenizer_config.json: 100%|██████| 806/806 [00:00<00:00, 468kB/s]
Downloading (…)_encoder/config.json: 100%|██████| 617/617 [00:00<00:00, 368kB/s]
Downloading (…)"model.safetensors";: 100%|███| 492M/492M [00:07<00:00, 62.2MB/s]
Downloading (…)_model.safetensors";: 100%|███| 335M/335M [00:05<00:00, 58.0MB/s]
Downloading (…)main/vae/config.json: 100%|██████| 547/547 [00:00<00:00, 318kB/s]
Downloading (…)_model.safetensors";: 100%|█| 3.44G/3.44G [00:57<00:00, 59.6MB/s]
Downloading (…)ain/unet/config.json: 100%|██████| 743/743 [00:00<00:00, 367kB/s]
Before training: Unet First Layer lora up tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
Before training: Unet First Layer lora down tensor([[ 0.0313,  0.0529,  0.0502,  ...,  0.0460, -0.0742,  0.0654],
        [-0.0749,  0.0173,  0.0325,  ...,  0.0723, -0.1217, -0.0258],
        [-0.0430,  0.0557,  0.0130,  ..., -0.0450, -0.0533,  0.1434],
        ...,
        [ 0.0225, -0.0323, -0.0743,  ...,  0.0159, -0.1046, -0.1281],
        [-0.0461,  0.0156, -0.0570,  ..., -0.0991, -0.0100,  0.0261],
        [-0.0122, -0.0389, -0.0491,  ..., -0.0592,  0.0051,  0.0871]])
Before training: text encoder First Layer lora up tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
Before training: text encoder First Layer lora down tensor([[-0.0043,  0.0337, -0.0224,  ..., -0.0400,  0.0368, -0.0298],
        [ 0.0145, -0.0724,  0.0391,  ..., -0.0054, -0.0377,  0.0256],
        [-0.0769,  0.1469, -0.0160,  ...,  0.0818, -0.0235, -0.0753],
        ...,
        [ 0.0431,  0.0232, -0.0489,  ..., -0.0584, -0.0682,  0.0089],
        [ 0.0007, -0.1088, -0.0459,  ...,  0.0215, -0.0274, -0.0291],
        [ 0.1224, -0.1680,  0.0102,  ...,  0.0027,  0.1284,  0.0541]])

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Downloading (…)cheduler_config.json: 100%|██████| 308/308 [00:00<00:00, 302kB/s]
***** Running training *****
  Num examples = 20
  Num batches each epoch = 20
  Num Epochs = 50
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 1000
Steps:  30%|███▌        | 301/1000 [01:11<03:00,  3.87it/s, loss=nan, lr=0.0003]^C
Traceback (most recent call last):
  File "/workspace/lora/training_scripts/train_lora_dreambooth.py", line 1008, in <module>
    main(args)
  File "/workspace/lora/training_scripts/train_lora_dreambooth.py", line 843, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 489, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py", line 580, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 837, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 265, in forward
    hidden_states = block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 291, in forward
    attn_output = self.attn1(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/cross_attention.py", line 205, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/cross_attention.py", line 300, in __call__
    query = attn.to_q(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lora_diffusion/lora.py", line 56, in forward
    + self.dropout(self.lora_up(self.selector(self.lora_down(input))))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
KeyboardInterrupt
Steps:  30%|███▌        | 302/1000 [01:11<02:45,  4.22it/s, loss=nan, lr=0.0003]

What could be wrong here?

And if it helps, these are the output during the first installation step:

Cloning into 'lora'...
remote: Enumerating objects: 934, done.
remote: Counting objects: 100% (473/473), done.
remote: Compressing objects: 100% (177/177), done.
remote: Total 934 (delta 339), reused 374 (delta 296), pack-reused 461
Receiving objects: 100% (934/934), 182.98 MiB | 8.52 MiB/s, done.
Resolving deltas: 100% (547/547), done.
Processing ./lora
  Preparing metadata (setup.py) ... done
Collecting diffusers>=0.11.0
  Downloading diffusers-0.13.1-py3-none-any.whl (716 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 716.4/716.4 kB 11.2 MB/s eta 0:00:00a 0:00:01
Collecting transformers>=4.25.1
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 81.8 MB/s eta 0:00:00:00:0100:01
Collecting scipy
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 73.3 MB/s eta 0:00:0000:0100:01
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 12.7 MB/s eta 0:00:00
Collecting fire
  Downloading fire-0.5.0.tar.gz (88 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.3/88.3 kB 14.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting wandb
  Downloading wandb-0.13.10-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 82.9 MB/s eta 0:00:00
Collecting safetensors
  Downloading safetensors-0.2.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 4.7 MB/s eta 0:00:0000:0100:01mm
Collecting opencv-python
  Downloading opencv_python-4.7.0.72-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (61.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.8/61.8 MB 45.5 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from lora-diffusion==0.1.7) (0.14.1+cu116)
Collecting mediapipe
  Downloading mediapipe-0.9.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (33.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.0/33.0 MB 52.7 MB/s eta 0:00:0000:0100:01
Collecting huggingface-hub>=0.10.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 kB 30.2 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from diffusers>=0.11.0->lora-diffusion==0.1.7) (1.24.2)
Collecting filelock
  Downloading filelock-3.9.0-py3-none-any.whl (9.7 kB)
Collecting regex!=2019.12.17
  Downloading regex-2022.10.31-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (770 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 770.5/770.5 kB 60.8 MB/s eta 0:00:00
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from diffusers>=0.11.0->lora-diffusion==0.1.7) (2.28.2)
Collecting importlib-metadata
  Downloading importlib_metadata-6.0.0-py3-none-any.whl (21 kB)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from diffusers>=0.11.0->lora-diffusion==0.1.7) (9.4.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.25.1->lora-diffusion==0.1.7) (6.0)
Collecting tqdm>=4.27
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 24.3 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.25.1->lora-diffusion==0.1.7) (23.0)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 61.3 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from fire->lora-diffusion==0.1.7) (1.14.0)
Collecting termcolor
  Downloading termcolor-2.2.0-py3-none-any.whl (6.6 kB)
Requirement already satisfied: wcwidth>=0.2.5 in /usr/local/lib/python3.10/dist-packages (from ftfy->lora-diffusion==0.1.7) (0.2.6)
Requirement already satisfied: attrs>=19.1.0 in /usr/local/lib/python3.10/dist-packages (from mediapipe->lora-diffusion==0.1.7) (22.2.0)
Collecting absl-py
  Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 126.5/126.5 kB 28.3 MB/s eta 0:00:00
Collecting protobuf<4,>=3.11
  Downloading protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 72.0 MB/s eta 0:00:00
Collecting matplotlib
  Downloading matplotlib-3.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 66.1 MB/s eta 0:00:0000:010:01m
Collecting opencv-contrib-python
  Downloading opencv_contrib_python-4.7.0.72-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (67.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.9/67.9 MB 37.1 MB/s eta 0:00:0000:0100:01
Collecting flatbuffers>=2.0
  Downloading flatbuffers-23.1.21-py2.py3-none-any.whl (26 kB)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torchvision->lora-diffusion==0.1.7) (4.5.0)
Requirement already satisfied: torch==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torchvision->lora-diffusion==0.1.7) (1.13.1+cu116)
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.31-py3-none-any.whl (184 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 184.3/184.3 kB 31.7 MB/s eta 0:00:00
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from wandb->lora-diffusion==0.1.7) (67.3.2)
Collecting setproctitle
  Downloading setproctitle-1.3.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... done
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting Click!=8.0.0,>=7.0
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96.6/96.6 kB 24.8 MB/s eta 0:00:00
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.16.0-py2.py3-none-any.whl (184 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 184.3/184.3 kB 38.7 MB/s eta 0:00:00
Collecting appdirs>=1.4.3
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Requirement already satisfied: psutil>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from wandb->lora-diffusion==0.1.7) (5.9.4)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.10-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 kB 12.6 MB/s eta 0:00:00
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/lib/python3/dist-packages (from requests->diffusers>=0.11.0->lora-diffusion==0.1.7) (1.25.8)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers>=0.11.0->lora-diffusion==0.1.7) (3.0.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->diffusers>=0.11.0->lora-diffusion==0.1.7) (2019.11.28)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests->diffusers>=0.11.0->lora-diffusion==0.1.7) (2.8)
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.6/140.6 kB 26.2 MB/s eta 0:00:00
Collecting zipp>=0.5
  Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe->lora-diffusion==0.1.7) (2.8.2)
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.3/98.3 kB 20.5 MB/s eta 0:00:00
Collecting contourpy>=1.0.1
  Downloading contourpy-1.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (300 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 300.3/300.3 kB 44.2 MB/s eta 0:00:00
Collecting fonttools>=4.22.0
  Downloading fonttools-4.38.0-py3-none-any.whl (965 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 965.4/965.4 kB 48.2 MB/s eta 0:00:00
Collecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.4.4-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 68.7 MB/s eta 0:00:00
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Building wheels for collected packages: lora-diffusion, fire, pathtools
  Building wheel for lora-diffusion (setup.py) ... done
  Created wheel for lora-diffusion: filename=lora_diffusion-0.1.7-py3-none-any.whl size=37990 sha256=11b8d0d058562b15778ed5317c6e107a473adf09cdc97061558a01300847925f
  Stored in directory: /tmp/pip-ephem-wheel-cache-9s6i78m3/wheels/4c/f3/eb/4cba90b61013bf007c3c6a92051c47a7f8cefbeb5aeec358e6
  Building wheel for fire (setup.py) ... done
  Created wheel for fire: filename=fire-0.5.0-py2.py3-none-any.whl size=116931 sha256=ea671185772f288d8eaf228d23bff8011dfc71b197a48f7f6b265361e5535136
  Stored in directory: /root/.cache/pip/wheels/90/d4/f7/9404e5db0116bd4d43e5666eaa3e70ab53723e1e3ea40c9a95
  Building wheel for pathtools (setup.py) ... done
  Created wheel for pathtools: filename=pathtools-0.1.2-py3-none-any.whl size=8791 sha256=22cc08b8c24db3532f7f926a68cd2960fb851d88eba1066b7cd1662a7c122417
  Stored in directory: /root/.cache/pip/wheels/e7/f3/22/152153d6eb222ee7a56ff8617d80ee5207207a8c00a7aab794
Successfully built lora-diffusion fire pathtools
Installing collected packages: tokenizers, safetensors, pathtools, flatbuffers, appdirs, zipp, urllib3, tqdm, termcolor, smmap, setproctitle, scipy, regex, pyparsing, protobuf, opencv-python, opencv-contrib-python, kiwisolver, ftfy, fonttools, filelock, docker-pycreds, cycler, contourpy, Click, absl-py, sentry-sdk, matplotlib, importlib-metadata, gitdb, fire, mediapipe, huggingface-hub, GitPython, wandb, transformers, diffusers, lora-diffusion
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.8
    Uninstalling urllib3-1.25.8:
      Successfully uninstalled urllib3-1.25.8
Successfully installed Click-8.1.3 GitPython-3.1.31 absl-py-1.4.0 appdirs-1.4.4 contourpy-1.0.7 cycler-0.11.0 diffusers-0.13.1 docker-pycreds-0.4.0 filelock-3.9.0 fire-0.5.0 flatbuffers-23.1.21 fonttools-4.38.0 ftfy-6.1.1 gitdb-4.0.10 huggingface-hub-0.12.1 importlib-metadata-6.0.0 kiwisolver-1.4.4 lora-diffusion-0.1.7 matplotlib-3.7.0 mediapipe-0.9.1.0 opencv-contrib-python-4.7.0.72 opencv-python-4.7.0.72 pathtools-0.1.2 protobuf-3.20.3 pyparsing-3.0.9 regex-2022.10.31 safetensors-0.2.8 scipy-1.10.1 sentry-sdk-1.16.0 setproctitle-1.3.2 smmap-5.0.0 termcolor-2.2.0 tokenizers-0.13.2 tqdm-4.64.1 transformers-4.26.1 urllib3-1.26.14 wandb-0.13.10 zipp-3.15.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: python -m pip install --upgrade pip
Collecting accelerate
  Downloading accelerate-0.16.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.7/199.7 kB 5.1 MB/s eta 0:00:00a 0:00:01
Collecting bitsandbytes
  Downloading bitsandbytes-0.37.0-py3-none-any.whl (76.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.3/76.3 MB 45.6 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: torch>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.13.1+cu116)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.24.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (23.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.4)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate) (4.5.0)
Installing collected packages: bitsandbytes, accelerate
Successfully installed accelerate-0.16.0 bitsandbytes-0.37.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: python -m pip install --upgrade pip

jackspp commented 1 year ago

similar problem. when PTI: Before training, loss becom nan.

jameskuma commented 1 year ago

same issue here, but I use sd model v2.1. The wired thing is that I use inject_trainable_lora into sd model with target_instead_model=["CrossAttention"] and return no parameters. For more details, my code is as follow:

from diffusers import UNet2DConditionModel
from lora_diffusion import inject_trainable_lora
unet2d = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-2-1", subfolder="unet")
params_1, name = inject_trainable_lora(unet2d, {"CrossAttention"}, verbose=True, r=4, scale=1.0)
print(params_1)

Anyone meet the same problems?

jameskuma commented 1 year ago

Hi, everyone!

I just find my issue is by caused by data_type!!!!

My code is as follow and I hope this can help any of you meet the same problem.

For the convenient usage, I rewrite the function inject_trainable_lora as

def inject_trainable_lora(
    model: nn.Module,
    target_replace_module: Set[str] = DEFAULT_TARGET_REPLACE,
    r: int = 4,
    loras=None,  # path to lora .pt
    verbose: bool = False,
    dropout_p: float = 0.0,
    scale: float = 1.0,
):
    """
    inject lora into model, and returns lora parameter groups.
    """

    # 👉 store parameters in ModuleList
    require_grad_params = torch.nn.ModuleList()

    if loras != None:
        loras = torch.load(loras)

    for _module, name, _child_module in _find_modules(
        model, target_replace_module, search_class=[nn.Linear]
    ):
        weight = _child_module.weight
        bias = _child_module.bias
        if verbose:
            print("LoRA Injection : injecting lora into ", name)
            print("LoRA Injection : weight shape", weight.shape)
        _tmp = LoraInjectedLinear(
            _child_module.in_features,
            _child_module.out_features,
            _child_module.bias is not None,
            r=r,
            dropout_p=dropout_p,
            scale=scale,
        )
        _tmp.linear.weight = weight
        if bias is not None:
            _tmp.linear.bias = bias

        # switch the module
        _tmp.to(_child_module.weight.device).to(_child_module.weight.dtype)
        _module._modules[name] = _tmp

        # 👉 append lora layer
        require_grad_params.append(_module._modules[name].lora_up)
        require_grad_params.append(_module._modules[name].lora_down)

        if loras != None:
            _module._modules[name].lora_up.weight = loras.pop(0)
            _module._modules[name].lora_down.weight = loras.pop(0)

        _module._modules[name].lora_up.weight.requires_grad = True
        _module._modules[name].lora_down.weight.requires_grad = True

    return require_grad_params

In this way, we could add lora parameters into optimizer more easily as

from diffusers import UNet2DConditionModel
from lora_diffusion import inject_trainable_lora
unet2d = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-2-1", subfolder="unet")
params_1 = inject_trainable_lora(unet2d, {"UNet2DConditionModel"}, verbose=True, r=4, scale=1.0)
optim = torch.optim.AdamW(params_1.parameters(), lr=0.0001)

If you have issue like loss = nan, pls check the data type and there might be a mixture of using both torch.float32 and torch.float16. And you need to set data type to torch.float32!!!

cloneofsimo / lora

nan loss during training #208