facebookincubator / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Apache License 2.0
4.54k stars 363 forks source link

[Bug] nn.Conv2d is not a generic kernel, it works only for specific numbers of channels. #220

Open bes-dev opened 1 year ago

bes-dev commented 1 year ago

When I try to compile simple convolution network, compilation process crash because conv2d.attrs_['op_instance'] is empty for the convolution layer. How can I fix it?

Behaviour can be reproduced by this script:

import logging
from aitemplate.compiler import ops
from aitemplate.frontend import nn
# AIT utils
from aitemplate.compiler import compile_model
from aitemplate.frontend import Tensor
from aitemplate.testing import detect_target

class MOCModel(nn.Module):
    def __init__(self, c_in=3, c_out=8):
        super().__init__()
        self.conv = nn.Conv2dBias(c_in, c_out, 3, 1, 1)

    def forward(self, x):
        x = self.conv(x)
        return x

def mark_output(y):
    if type(y) is not tuple:
        y = (y,)
    for i in range(len(y)):
        y[i]._attrs["is_output"] = True
        y[i]._attrs["name"] = "output_%d" % (i)
        y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
        print("AIT output_{} shape: {}".format(i, y_shape))

def compile_moc(
        batch_size,
        input_size,
        c_in = 3,
        c_out = 8,
        use_fp16_acc = False,
        convert_conv_to_gemm = False,
        output_dir="./tmp/"
):
    ait_model = MOCModel(c_in, c_out)
    ait_input = Tensor(
        shape=[batch_size, input_size, input_size, c_in],
        name="input0",
        is_input=True,
    )
    ait_model.name_parameter_tensor()

    ait_out = ait_model(ait_input)
    mark_output(ait_out)
    target = detect_target(
        use_fp16_acc=use_fp16_acc,
        convert_conv_to_gemm=convert_conv_to_gemm
    )
    compile_model(
        ait_out,
        target,
        output_dir,
        "moc"
    )

def main():
    logging.getLogger().setLevel(logging.INFO)
    logger = logging.getLogger()
    logger.info("Compile model...")
    compile_moc(
        batch_size=1,
        input_size=256,
        use_fp16_acc=True,
        convert_conv_to_gemm=True,
    )

if __name__ == "__main__":
    main()

Output:

Traceback (most recent call last):
  File "/workspace/work/example/compile_moc.py", line 74, in <module>
    main()
  File "/workspace/work/example/compile_moc.py", line 65, in main
    compile_moc(
  File "/workspace/work/example/compile_moc.py", line 53, in compile_moc
    compile_model(
  File "/opt/conda/lib/python3.10/site-packages/aitemplate/compiler/compiler.py", line 200, in compile_model
    compiler.transform.profile(
  File "/opt/conda/lib/python3.10/site-packages/aitemplate/compiler/transform/profile.py", line 104, in profile
    f.profile(
  File "/opt/conda/lib/python3.10/site-packages/aitemplate/compiler/ops/conv/conv2d.py", line 528, in profile
    self._profile_static(workdir, devices)
  File "/opt/conda/lib/python3.10/site-packages/aitemplate/compiler/ops/conv/conv2d.py", line 572, in _profile_static
    best_algo, workspace = self._profile_single_workload(
  File "/opt/conda/lib/python3.10/site-packages/aitemplate/compiler/ops/conv/conv2d.py", line 433, in _profile_single_workload
    tmp_key = next(iter(self._attrs["op_instance"].keys()))
StopIteration

conv2d.attrs_ state:

{'name': 'conv2d_bias_0', 'depth': 0, 'nop': False, 'inputs': [{ 'check_nan_and_inf': False,                                                                                         [65/1832]
  'check_outputs': False,                                                                                                                                                                     
  'constant_folding_output_idx': None,                                                                                                                                                        
  'data': None,                                                                                                                                                                               
  'depth': 0,                                                                                                                                                                                 
  'dst_ops': ['conv2d_bias_0'],                                                                                                                                                               
  'dtype': 'float16',                                                                                                                                                                         
  'external_tensor': None,                                                                                                                                                                    
  'has_output_aliases': False,                                                                                                                                                                
  'is_input': True,                                                                                                                                                                           
  'is_output': False,                                                                                                                                                                         
  'is_param': False,
  'is_view_of': None,
  'name': 'input0',
  'nop': False,
  'offset': None,
  'shape': [ {'depth': 0, 'name': 'input0_dim_0', 'nop': False, 'values': [1]},
             {'depth': 0, 'name': 'input0_dim_1', 'nop': False, 'values': [256]},
             {'depth': 0, 'name': 'input0_dim_2', 'nop': False, 'values': [256]},
             {'depth': 0, 'name': 'input0_dim_3', 'nop': False, 'values': [3]}],
  'src_ops': [],
  'value': None}, { 'check_nan_and_inf': False,
  'check_outputs': False,
  'constant_folding_output_idx': None,
  'data': None,
  'depth': 0,
  'dst_ops': ['conv2d_bias_0'],
  'dtype': 'float16',
  'external_tensor': None,
  'has_output_aliases': False,
  'is_input': False,
  'is_output': False,
  'is_param': True,
  'is_view_of': None,
  'name': 'conv_weight',
  'nop': False,
  'offset': None,
  'shape': [ {'depth': 0, 'name': 'conv_weight_dim_0', 'nop': False, 'values': [8]},
             {'depth': 0, 'name': 'conv_weight_dim_1', 'nop': False, 'values': [3]},
             {'depth': 0, 'name': 'conv_weight_dim_2', 'nop': False, 'values': [3]},
             {'depth': 0, 'name': 'conv_weight_dim_3', 'nop': False, 'values': [3]}],
  'src_ops': [],
  'value': None}, { 'check_nan_and_inf': False,
  'check_outputs': False,
  'constant_folding_output_idx': None,
  'data': None,
  'depth': 0,
  'dst_ops': ['conv2d_bias_0'],
  'dtype': 'float16',
  'external_tensor': None,
  'has_output_aliases': False,
  'is_input': False,
  'is_output': False,
  'is_param': True,
  'is_view_of': None,
  'name': 'conv_bias',
  'nop': False,
  'offset': None,
  'shape': [ {'depth': 0, 'name': 'conv_bias_dim_0', 'nop': False, 'values': [8]}],
  'src_ops': [],
  'value': None}], 'has_profiler': True, 'op': 'conv2d_bias', 'stride': 1, 'pad': 1, 'dilate': 1, 'group': 1, 'epilogue_alignment': 8, 'epilogue': 'LinearCombination', 'workspace': 0, 'split
_k': None, 'CO': 8, 'KH': 3, 'KW': 3, 'exec_path': OrderedDict([('NI == 1 && HI == 256 && WI == 256 && CI == 3', '')]), 'outputs': [{ 'check_nan_and_inf': False,
  'check_outputs': False,
  'constant_folding_output_idx': None,
  'data': None,
  'depth': 1,
  'dst_ops': [],
  'dtype': 'float16',
  'external_tensor': None,
  'has_output_aliases': False,
  'is_input': False,
  'is_output': True,
  'is_param': False,
 'is_view_of': None,
  'name': 'output_0',
  'nop': False,
  'offset': None,
  'shape': [ {'depth': 0, 'name': 'input0_dim_0', 'nop': False, 'values': [1]},
             {'depth': 0, 'name': 'output_0_dim_1', 'nop': False, 'values': [256]},
             {'depth': 0, 'name': 'output_0_dim_2', 'nop': False, 'values': [256]},
             {'depth': 0, 'name': 'output_0_dim_3', 'nop': False, 'values': [8]}],
  'src_ops': ['conv2d_bias_0'],
  'value': None}], 'original_name': 'conv2d_bias_0', '**op_instance': {}**}
bes-dev commented 1 year ago

If I try to use the same script but with FC model from AITemplate docs, it works well:

import logging
from aitemplate.compiler import ops
from aitemplate.frontend import nn
# AIT utils
from aitemplate.compiler import compile_model
from aitemplate.frontend import Tensor
from aitemplate.testing import detect_target

class AITSimpleModel(nn.Module):
  def __init__(self, hidden, eps: float = 1e-5):
    super().__init__()
    self.dense1 = nn.Linear(hidden, 4 * hidden, specialization="fast_gelu")
    self.dense2 = nn.Linear(4 * hidden, hidden)
    self.layernorm = nn.LayerNorm(hidden, eps=eps)

  def forward(self, input):
    hidden_states = self.dense1(input)
    hidden_states = self.dense2(hidden_states)
    hidden_states = hidden_states + input
    hidden_states = self.layernorm(hidden_states)
    return hidden_states

def mark_output(y):
    if type(y) is not tuple:
        y = (y,)
    for i in range(len(y)):
        y[i]._attrs["is_output"] = True
        y[i]._attrs["name"] = "output_%d" % (i)
        y_shape = [d._attrs["values"][0] for d in y[i]._attrs["shape"]]
        print("AIT output_{} shape: {}".format(i, y_shape))

def compile_moc(
        batch_size,
        input_size,
        c_in = 3,
        c_out = 8,
        use_fp16_acc = False,
        convert_conv_to_gemm = False,
        output_dir="./tmp/"
):
    ait_model = AITSimpleModel(32)
    ait_input = Tensor(
        shape=[1, 32],
        name="input0",
        is_input=True,
    )
    ait_model.name_parameter_tensor()

    ait_out = ait_model(ait_input)
    mark_output(ait_out)
    target = detect_target(
        use_fp16_acc=use_fp16_acc,
        convert_conv_to_gemm=convert_conv_to_gemm
    )
    compile_model(
        ait_out,
        target,
        output_dir,
        "moc"
    )

def main():
    logging.getLogger().setLevel(logging.INFO)
    logger = logging.getLogger()
    logger.info("Compile model...")
    compile_moc(
        batch_size=1,
        input_size=256,
        use_fp16_acc=True,
        convert_conv_to_gemm=True,
    )

if __name__ == "__main__":
    main()
bes-dev commented 1 year ago

Omg, I investigated this issue. So, the problem is related to the input shape of the tensor. Here we try to find the cuda kernel for Conv2d operation that corresponds to input/output channels. If there is not a suitable kernel for our number of channels (for example, small convolution with c_in=3, c_out=8), this function returns an empty op_instance array. So, we haven't any kernel that we can apply to our layer, compilation process fails. It looks like a bug of the AITemplate, because we need default implementation for convolution kernel!

hl475 commented 1 year ago

Thanks for reporting! @aakhundov can you please help take a look when you get a chance? (you added conv_common.py, and apologize if tagging the wrong person)

bes-dev commented 1 year ago
aakhundov commented 1 year ago

@bes-dev Thank you for reporting and investigating the issue! I've reproduced it, and can confirm that your initial example with nn.Conv2dBias indeed results in that particular error. We'll look further into this and provide more details.

bes-dev commented 1 year ago
import logging
from aitemplate.compiler import ops
from aitemplate.frontend import nn
# AIT utils
from aitemplate.compiler import compile_model
from aitemplate.frontend import Tensor
from aitemplate.testing import detect_target

class MOCModel(nn.Module):
    def __init__(self, c_in=3, c_out=8):
        super().__init__()
        self.conv = nn.Conv2dBias(c_in, c_out, 3, 1, 1)

    def forward(self, x):
        x = self.conv(x)
        return x

I found that it can be rewritten something like this:

from aitemplate.frontend import nn

class MOCModel(nn.Module):
    def __init__(self, c_in=3, c_out=8):
        super().__init__()
        self.conv = nn.Conv2dBiasFewChannels(c_in, c_out, 3, 1, 1)

    def forward(self, x):
        x = self.conv(x)
        return x

model = MOCModel(3, 64)

It works for me, but the weight tensor of model.conv has unexpected shape [64, 3, 3, 4] instead of [64, 3, 3, 3] and we need to pad weights of source model during weight mapping.

aakhundov commented 1 year ago

@bes-dev So the story goes like this. By default, AIT assumes c_in of 4 or 8 to rely on the higher-performing configuration of the kernels backing the conv ops' implementation.

As you've noticed, it is possible to use c_in=3 with nn.Conv2dBiasFewChannels, as it uses common_conv2d_few_channels op under the hood that sets specific kernel configuration for ch_in=3 here.

However, as you've also noticed, by default even nn.Conv2dBiasFewChannels pads the weights to 4 channels (here): again, to improve performance. If you want to avoid that, you can set auto_padding=False in the nn.Conv2dBiasFewChannels constructor. The padding then won't happen and you'll have [64, 3, 3, 3]-shaped weights, but it comes at the cost of (potentially) worse performance.

fatihcelikbas commented 1 year ago

Are there any solutions being developed for this at the moment? I am trying to optimize stable diffusion inpainting with AIT but the input channel is 9. I am getting the same StopIteration issue.

j3698w commented 1 year ago

^ bump, have also came across this issue. I ended up padding the input and weights so that a suitable kernel could be found.

delldu commented 10 months ago

examples/08_esrgan same situation $ python compile.py --model-path RealESRGAN_x4plus.pth File "/media//8t/Workspace/study/AITemplate/python/aitemplate/backend/cuda/conv2d/common.py", line 805, in gen_function emitted_instance = f_emit_instance(op_instance[value]) KeyError: ''

delldu commented 10 months ago

examples/01_resnet-50 $ python infer_with_torch.py File "/media//8t/Workspace/study/AITemplate/python/aitemplate/backend/cuda/conv2d/common.py", line 805, in gen_function emitted_instance = f_emit_instance(op_instance[value]) KeyError: ''

delldu commented 10 months ago

examples/01_resnet-50 $ python benchmark_ait.py File "/media/8t/Workspace/study/AITemplate/python/aitemplate/backend/cuda/conv2d/common.py", line 805, in gen_function emitted_instance = f_emit_instance(op_instance[value]) KeyError: ''