Failed to enable layernorm kernel

MrPeterJin commented 7 months ago

Firstly very appreciate your work! When I try to use the framework to reproduce your work, I noticed the layernorm kernel is not working on my side. Here is the log:

torchrun --standalone --nproc_per_node=1 scripts/dit/train_dit.py \
     --model DiT-XL/2 \
     --batch_size 2 \
     --enable_layernorm_kernel \
     --enable_flashattn \
     --mixed_precision bf16 \
     --num_classes 10
/home/cjinag/code/playground/ColossalAI/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/cjinag/code/playground/ColossalAI/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[03/20/24 15:13:27] INFO     colossalai - colossalai - INFO: /home/cjinag/code/playground/ColossalAI/colossalai/initialize.py:67     
                             launch                                                                                                  
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1                   
[2024-03-20 15:13:27] Experiment directory created at ./outputs/018-DiT-XL-2
[2024-03-20 15:13:39] Model params: 642.77 M
No ROCm runtime is found, using ROCM_HOME='/usr/local'
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.12372970581054688 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.14196491241455078 seconds
/home/cjinag/code/playground/ColossalAI/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Files already downloaded and verified
[2024-03-20 15:13:51] Dataset contains 50,000 images (./datasets)
[2024-03-20 15:13:51] Boost model for distributed training
[2024-03-20 15:13:51] Training for 1400 epochs...
[2024-03-20 15:13:51] Beginning epoch 0...
Epoch 0:   0%|                                                                                             | 0/25000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/cjinag/code/project/multimodal/OpenDiT/scripts/dit/train_dit.py", line 324, in <module>
    main(args)
  File "/home/cjinag/code/project/multimodal/OpenDiT/scripts/dit/train_dit.py", line 245, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/respace.py", line 90, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/gaussian_diffusion.py", line 708, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/diffusion/respace.py", line 120, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/code/playground/ColossalAI/colossalai/booster/plugin/low_level_zero_plugin.py", line 65, in forward
    return super().forward(*args, **kwargs)
  File "/home/cjinag/code/playground/ColossalAI/colossalai/interface/model.py", line 25, in forward
    return self.module(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/models/dit/dit.py", line 213, in forward
    x = block(x, c)  # (N, T, D)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/models/dit/dit.py", line 58, in forward
    modulate(self.norm1, x, shift_msa, scale_msa, self.enable_modulate_kernel)
  File "/home/cjinag/code/project/multimodal/OpenDiT/opendit/modules/layers.py", line 33, in modulate
    x = norm_func(x.to(torch.float32)).to(dtype)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 323, in forward
    return fused_layer_norm(input, self.normalized_shape, self.eps, self.memory_efficient)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 203, in fused_layer_norm
    return FusedLayerNormFunction.apply(*args)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 149, in forward
    output, mean, invvar = fused_layer_norm_cuda.forward(input_, ctx.normalized_shape, ctx.eps)
RuntimeError: memory format option is only supported by strided tensors
[2024-03-20 15:13:58,980] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 155022) of binary: /home/cjinag/anaconda3/envs/opendit/bin/python
Traceback (most recent call last):
  File "/home/cjinag/anaconda3/envs/opendit/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/cjinag/anaconda3/envs/opendit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/dit/train_dit.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_15:13:58
  host      : 191host040.mobilenet.cse.ust.hk
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 155022)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Please advise possible solutions. Thanks!

oahzxl commented 7 months ago

its a verison mismatch problem for apex. maybe your apex version is too old or too new. you can first disable enable_layernorm_kernel arg to run the code

MrPeterJin commented 7 months ago

its a verison mismatch problem for apex. maybe your apex version is too old or too new. you can first disable enable_layernorm_kernel arg to run the code

When I disabled the layernorm kernel, the code runs fine for me. However, I have conducted a reinstallation of OpenDiT according to the version recommended in your README file and this error log still exists. Is there any other possible reasons?

oahzxl commented 7 months ago

sorry no clues. i suppose it should be about your enviroment and apex.

MrPeterJin commented 7 months ago

sorry no clues. i suppose it should be about your enviroment and apex.

Then may I have a reference for your environment settings?(e.g. torch version, CUDA, etc.), since your requirements.txt does not restricting this... I suspect the new version of PyTorch may have something changed to have this error.

oahzxl commented 7 months ago

we use cuda 11.8 and torch 2.1.2, good luck

MrPeterJin commented 7 months ago

we use cuda 11.8 and torch 2.1.2, good luck

What is the cudnn version on your platform? Just call print(torch.backends.cudnn.version()) for the output.

oahzxl commented 7 months ago

cudnn 8.9.7

MrPeterJin commented 7 months ago

cudnn 8.9.7

I noticed through your installation guidelines in your README file, it will automatically update the torch and other dependencies to the newest version and cause version mismatch. So I think you probably need to fix the version in your environment settings.

MrPeterJin commented 7 months ago

Thanks for providing your settings. I commenced the training successfully.

NUS-HPC-AI-Lab / VideoSys

Failed to enable layernorm kernel #113