nan loss during training #208

magickaito opened 1 year ago

magickaito commented 1 year ago

Hi guys I am using this colab notebook by pedrogengo

For unknown reasons, I keep getting nan loss during training. This happens whenever the training steps is higher than 500. If the steps is 500. It appears ok (but too low to be usable)

This happened on both my copy of google colab and a hosted runpod pytorch container with 24GB graphics memory.

These are the configurations:

PROMPT="a photo of wendy030305 man"
STEPS = 1000
FP_16 = True

Nothing much changed.

There are the output that shows loss becoming nan during the training:

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/usr/local/lib/python3.10/dist-packages/accelerate/ FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
/usr/local/lib/python3.10/dist-packages/accelerate/ UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
Downloading (…)tokenizer/vocab.json: 100%|██| 1.06M/1.06M [00:03<00:00, 347kB/s]
Downloading (…)tokenizer/merges.txt: 100%|████| 525k/525k [00:01<00:00, 432kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████| 472/472 [00:00<00:00, 540kB/s]
Downloading (…)okenizer_config.json: 100%|██████| 806/806 [00:00<00:00, 468kB/s]
Downloading (…)_encoder/config.json: 100%|██████| 617/617 [00:00<00:00, 368kB/s]
Downloading (…)"model.safetensors";: 100%|███| 492M/492M [00:07<00:00, 62.2MB/s]
Downloading (…)_model.safetensors";: 100%|███| 335M/335M [00:05<00:00, 58.0MB/s]
Downloading (…)main/vae/config.json: 100%|██████| 547/547 [00:00<00:00, 318kB/s]
Downloading (…)_model.safetensors";: 100%|█| 3.44G/3.44G [00:57<00:00, 59.6MB/s]
Downloading (…)ain/unet/config.json: 100%|██████| 743/743 [00:00<00:00, 367kB/s]
Before training: Unet First Layer lora up tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
Before training: Unet First Layer lora down tensor([[ 0.0313,  0.0529,  0.0502,  ...,  0.0460, -0.0742,  0.0654],
        [-0.0749,  0.0173,  0.0325,  ...,  0.0723, -0.1217, -0.0258],
        [-0.0430,  0.0557,  0.0130,  ..., -0.0450, -0.0533,  0.1434],
        [ 0.0225, -0.0323, -0.0743,  ...,  0.0159, -0.1046, -0.1281],
        [-0.0461,  0.0156, -0.0570,  ..., -0.0991, -0.0100,  0.0261],
        [-0.0122, -0.0389, -0.0491,  ..., -0.0592,  0.0051,  0.0871]])
Before training: text encoder First Layer lora up tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
Before training: text encoder First Layer lora down tensor([[-0.0043,  0.0337, -0.0224,  ..., -0.0400,  0.0368, -0.0298],
        [ 0.0145, -0.0724,  0.0391,  ..., -0.0054, -0.0377,  0.0256],
        [-0.0769,  0.1469, -0.0160,  ...,  0.0818, -0.0235, -0.0753],
        [ 0.0431,  0.0232, -0.0489,  ..., -0.0584, -0.0682,  0.0089],
        [ 0.0007, -0.1088, -0.0459,  ...,  0.0215, -0.0274, -0.0291],
        [ 0.1224, -0.1680,  0.0102,  ...,  0.0027,  0.1284,  0.0541]])

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to:
/usr/local/lib/python3.10/dist-packages/diffusers/ FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Downloading (…)cheduler_config.json: 100%|██████| 308/308 [00:00<00:00, 302kB/s]
***** Running training *****
  Num examples = 20
  Num batches each epoch = 20
  Num Epochs = 50
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 1000
Steps:  30%|███▌        | 301/1000 [01:11<03:00,  3.87it/s, loss=nan, lr=0.0003]^C
Traceback (most recent call last):
  File "/workspace/lora/training_scripts/", line 1008, in <module>
  File "/workspace/lora/training_scripts/", line 843, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/", line 489, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/", line 580, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/", line 837, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/", line 265, in forward
    hidden_states = block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/", line 291, in forward
    attn_output = self.attn1(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/", line 205, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/", line 300, in __call__
    query = attn.to_q(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lora_diffusion/", line 56, in forward
    + self.dropout(self.lora_up(self.selector(self.lora_down(input))))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/", line 114, in forward
    return F.linear(input, self.weight, self.bias)
Steps:  30%|███▌        | 302/1000 [01:11<02:45,  4.22it/s, loss=nan, lr=0.0003]

What could be wrong here?

And if it helps, these are the output during the first installation step:

jackspp commented 1 year ago

similar problem. when PTI: Before training, loss becom nan.

jameskuma commented 10 months ago

same issue here, but I use sd model v2.1. The wired thing is that I use inject_trainable_lora into sd model with target_instead_model=["CrossAttention"] and return no parameters. For more details, my code is as follow:

from diffusers import UNet2DConditionModel
from lora_diffusion import inject_trainable_lora
unet2d = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-2-1", subfolder="unet")
params_1, name = inject_trainable_lora(unet2d, {"CrossAttention"}, verbose=True, r=4, scale=1.0)

Anyone meet the same problems?

jameskuma commented 10 months ago

Hi, everyone!

I just find my issue is by caused by data_type!!!!

My code is as follow and I hope this can help any of you meet the same problem.

For the convenient usage, I rewrite the function inject_trainable_lora as

def inject_trainable_lora(
    model: nn.Module,
    target_replace_module: Set[str] = DEFAULT_TARGET_REPLACE,
    r: int = 4,
    loras=None,  # path to lora .pt
    verbose: bool = False,
    dropout_p: float = 0.0,
    scale: float = 1.0,
    inject lora into model, and returns lora parameter groups.

    # 👉 store parameters in ModuleList
    require_grad_params = torch.nn.ModuleList()

    if loras != None:
        loras = torch.load(loras)

    for _module, name, _child_module in _find_modules(
        model, target_replace_module, search_class=[nn.Linear]
        weight = _child_module.weight
        bias = _child_module.bias
        if verbose:
            print("LoRA Injection : injecting lora into ", name)
            print("LoRA Injection : weight shape", weight.shape)
        _tmp = LoraInjectedLinear(
            _child_module.bias is not None,
        _tmp.linear.weight = weight
        if bias is not None:
            _tmp.linear.bias = bias

        # switch the module
        _module._modules[name] = _tmp

        # 👉 append lora layer

        if loras != None:
            _module._modules[name].lora_up.weight = loras.pop(0)
            _module._modules[name].lora_down.weight = loras.pop(0)

        _module._modules[name].lora_up.weight.requires_grad = True
        _module._modules[name].lora_down.weight.requires_grad = True

    return require_grad_params

In this way, we could add lora parameters into optimizer more easily as

from diffusers import UNet2DConditionModel
from lora_diffusion import inject_trainable_lora
unet2d = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-2-1", subfolder="unet")
params_1 = inject_trainable_lora(unet2d, {"UNet2DConditionModel"}, verbose=True, r=4, scale=1.0)
optim = torch.optim.AdamW(params_1.parameters(), lr=0.0001)

If you have issue like loss = nan, pls check the data type and there might be a mixture of using both torch.float32 and torch.float16. And you need to set data type to torch.float32!!!