Open handoku opened 1 year ago
@terrychenism wondering if you have any insights about the problem with UNet?
We don't have plan for support T4/A10 gpus, but will support H100.
I also encounter this issue on RTX 4090 when compiling Stable Diffusion inpainting UNET which expect 9 input channels. Any update or insight on this issue?
@LiJChang Inpaint UNet requires padding for conv_in.
__init__
in_channels = self.in_channels + (4 - (self.in_channels % 4))
self.conv_in = nn.Conv2dBias(in_channels, block_out_channels[0], 3, 1, 1)
if self.in_channels % 4 != 0:
channel_pad = self.in_channels + (4 - (self.in_channels % 4))
sample = ops.pad_last_dim(4, channel_pad)(sample)
sample = self.conv_in(sample)
and mapping
pad_by = 4 - (in_channels % 4)
params_ait["conv_in_weight"] = torch.functional.F.pad(params_ait["conv_in_weight"], (0, pad_by))
Thanks for the helpful information @hlky! It worked!
Thanks for the helpful information @hlky! It worked!
Could you please tell me how did you solve this 9 input channels problem? I can't see the info from hlky, thanks!
I am doing benchmark tests for UNet with AIT on A100/A10/T4 etc.
tests on T4 have finished, it work well.
However.
The build process stopped within profile procedure, logs are as follow:
have no idea what happened. Does it mean that although the build procedure finished,it just can not find out a implementation for this conv2d op instance?How to fix it?
return "86"
for A10 in _detect_cuda function. I get this Error:The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "test.py", line 61, in
compile_unet(
File "/root/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_unet.py", line 93, in compile_unet
compile_model(Y, target, tmp_dir, "UNet2DConditionModel", constants=params_ait)
File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 222, in compile_model
with target:
File "/usr/local/lib/python3.8/dist-packages/aitemplate/backend/cuda/target_def.py", line 156, in enter
self._operators = f_gen_ops(self._arch)
File "/usr/local/lib/python3.8/dist-packages/aitemplate/backend/cuda/utils.py", line 60, in gen_ops
raise NotImplementedError(
NotImplementedError: Arch 86 is not supported by current cutlass lib.
Strongly expecting helps. Based on my test's result, AIT and oneflow diffusers achieve the best performance on T4. But stable diffusion models are slow on T4, really need better performance on A10.
Updated: the most recent version of tensorrt outperforms oneflow. Nevertheless, still interested in AIT for AIT‘s good friendliness with pytorch runtime
Environment
I am doing test within a cuda 117 docker container. A10/A100 host nvidia driver version is 470.161.03