facebookincubator / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Apache License 2.0
4.54k stars 363 forks source link

problems encountered when doing benchmark test #558

Open handoku opened 1 year ago

handoku commented 1 year ago

I am doing benchmark tests for UNet with AIT on A100/A10/T4 etc.

tests on T4 have finished, it work well.

However.

The build process stopped within profile procedure, logs are as follow:

2023-04-09 11:57:58,471 INFO <aitemplate.compiler.ops.conv.conv_common> generating profiler_filename='conv2d_bias_add_identity_433da49a14b3f2b9721875f8f077fac5b46b7b61_3'
2023-04-09 11:57:58,557 INFO <aitemplate.utils.environ> force_cache=False
2023-04-09 11:57:58,558 INFO <aitemplate.compiler.ops.conv.conv_common> generating profiler_filename='conv2d_bias_8dfda00a58a23814d1fe705d888dae8547117ce4_3'
2023-04-09 11:57:58,566 INFO <aitemplate.compiler.transform.profile> generated 298 profilers elapsed time: 0:00:31.797149
2023-04-09 11:57:58,566 INFO <aitemplate.backend.builder> Using 112 CPU for building
2023-04-09 11:57:58,567 INFO <aitemplate.backend.builder> combined 78 profiler sources into 78
2023-04-09 11:57:58,568 INFO <aitemplate.backend.builder> compiling 78 profiler sources
2023-04-09 11:57:58,568 INFO <aitemplate.backend.builder> linking 12 profiler executables
2023-04-09 11:59:58,296 INFO <aitemplate.compiler.transform.profile> compiled profilers elapsed time: 0:01:59.729799
2023-04-09 11:59:58,299 INFO <aitemplate.utils.environ> force_cache=False
2023-04-09 11:59:58,299 INFO <aitemplate.compiler.ops.conv.conv2d> Profile: conv2d_bias_128: NI == 2 && HI == 64 && WI == 64 && CI == 9
Traceback (most recent call last):
  File "test.py", line 61, in <module>
    compile_unet(
  File "/root/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_unet.py", line 93, in compile_unet
    compile_model(Y, target, tmp_dir, "UNet2DConditionModel", constants=params_ait)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 271, in compile_model
    compiler.transform.profile(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/transform/profile.py", line 103, in profile
    f.profile(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/conv/conv2d.py", line 567, in profile
    self._profile_static(workdir, devices)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/conv/conv2d.py", line 607, in _profile_static
    best_algo, workspace = self._profile_single_workload(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/conv/conv2d.py", line 466, in _profile_single_workload
    tmp_key = next(iter(self._attrs["op_instance"].keys()))
StopIteration

have no idea what happened. Does it mean that although the build procedure finished,it just can not find out a implementation for this conv2d op instance?How to fix it?

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "test.py", line 61, in compile_unet( File "/root/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_unet.py", line 93, in compile_unet compile_model(Y, target, tmp_dir, "UNet2DConditionModel", constants=params_ait) File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 222, in compile_model with target: File "/usr/local/lib/python3.8/dist-packages/aitemplate/backend/cuda/target_def.py", line 156, in enter self._operators = f_gen_ops(self._arch) File "/usr/local/lib/python3.8/dist-packages/aitemplate/backend/cuda/utils.py", line 60, in gen_ops raise NotImplementedError( NotImplementedError: Arch 86 is not supported by current cutlass lib.


Cutlass c++ should support A10 according to their doc. Is it just because pycutlass does not provider a generator for SM86?

And, if I forcibly use Arch 80 for A10. I get same error as on A100.
```bash
2023-04-09 12:25:47,331 INFO <aitemplate.compiler.ops.conv.conv_common> generating profiler_filename='conv2d_bias_8dfda00a58a23814d1fe705d888dae8547117ce4_3'
2023-04-09 12:25:47,340 INFO <aitemplate.compiler.transform.profile> generated 298 profilers elapsed time: 0:00:47.909045
2023-04-09 12:25:47,340 INFO <aitemplate.backend.builder> Using 236 CPU for building
2023-04-09 12:25:47,342 INFO <aitemplate.backend.builder> combined 78 profiler sources into 78
2023-04-09 12:25:47,343 INFO <aitemplate.backend.builder> compiling 78 profiler sources
2023-04-09 12:25:47,343 INFO <aitemplate.backend.builder> linking 12 profiler executables
2023-04-09 12:26:31,779 INFO <aitemplate.compiler.transform.profile> compiled profilers elapsed time: 0:00:44.438382
2023-04-09 12:26:31,782 INFO <aitemplate.utils.environ> force_cache=False
2023-04-09 12:26:31,782 INFO <aitemplate.compiler.ops.conv.conv2d> Profile: conv2d_bias_128: NI == 2 && HI == 64 && WI == 64 && CI == 9
Traceback (most recent call last):
  File "test.py", line 61, in <module>
    compile_unet(
  File "/root/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_unet.py", line 93, in compile_unet
    compile_model(Y, target, tmp_dir, "UNet2DConditionModel", constants=params_ait)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 271, in compile_model
    compiler.transform.profile(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/transform/profile.py", line 103, in profile
    f.profile(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/conv/conv2d.py", line 567, in profile
    self._profile_static(workdir, devices)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/conv/conv2d.py", line 607, in _profile_static
    best_algo, workspace = self._profile_single_workload(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/ops/conv/conv2d.py", line 466, in _profile_single_workload
    tmp_key = next(iter(self._attrs["op_instance"].keys()))
StopIteration

Strongly expecting helps. Based on my test's result, AIT and oneflow diffusers achieve the best performance on T4. But stable diffusion models are slow on T4, really need better performance on A10.

Updated: the most recent version of tensorrt outperforms oneflow. Nevertheless, still interested in AIT for AIT‘s good friendliness with pytorch runtime

Environment

I am doing test within a cuda 117 docker container. A10/A100 host nvidia driver version is 470.161.03

hl475 commented 1 year ago

@terrychenism wondering if you have any insights about the problem with UNet?

terrychenism commented 1 year ago

We don't have plan for support T4/A10 gpus, but will support H100.

LiJChang commented 1 year ago

I also encounter this issue on RTX 4090 when compiling Stable Diffusion inpainting UNET which expect 9 input channels. Any update or insight on this issue?

hlky commented 1 year ago

@LiJChang Inpaint UNet requires padding for conv_in. __init__

in_channels = self.in_channels + (4 - (self.in_channels % 4))
self.conv_in = nn.Conv2dBias(in_channels, block_out_channels[0], 3, 1, 1)

forward

        if self.in_channels % 4 != 0:
            channel_pad = self.in_channels + (4 - (self.in_channels % 4))
            sample = ops.pad_last_dim(4, channel_pad)(sample)

        sample = self.conv_in(sample)

and mapping

pad_by = 4 - (in_channels % 4)
params_ait["conv_in_weight"] = torch.functional.F.pad(params_ait["conv_in_weight"], (0, pad_by))
LiJChang commented 1 year ago

Thanks for the helpful information @hlky! It worked!

AppleAndBanana commented 1 year ago

Thanks for the helpful information @hlky! It worked!

Could you please tell me how did you solve this 9 input channels problem? I can't see the info from hlky, thanks!