Xilinx / Vitis-AI

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards.
https://www.xilinx.com/ai
Apache License 2.0
1.48k stars 629 forks source link

Alveo U280 "Error: CU timeout" with PointPillars example #780

Closed simonschoening closed 2 years ago

simonschoening commented 2 years ago

I am trying to get the PointPillars example (demo/Vitis-AI-Library/samples/pointpillars) working on the Alveo U280 and Vitis AI 1.4.1.

It seems to me that the model and library were not originally designed for the DPUCAHX8Hor DPUCAHX8L. The library expects that the PFN's Max Pool operation with a kernel size of 100 is supported by the DPU. This is not supported by DPUCAHX8Hor DPUCAHX8L.

Therefore I modified pt_pointpillars_kitti_12000_100_10.8G_1.4/code/test/models/pointpillars.py to use three smaller MaxPool2D instead, and quantized + compiled the model for the DPUCAHX8H:

class PFNLayer(nn.Module):
    def __init__(self,
                 in_channels,
                 out_channels,
                 use_norm=True,
                 last_layer=False):
        """
        Pillar Feature Net Layer.
        The Pillar Feature Net could be composed of a series of these layers, but the PointPillars paper results only
        used a single PFNLayer. This layer performs a similar role as second.pytorch.voxelnet.VFELayer.
        :param in_channels: <int>. Number of input channels.
        :param out_channels: <int>. Number of output channels.
        :param use_norm: <bool>. Whether to include BatchNorm.
        :param last_layer: <bool>. If last_layer, there is no concatenation of features.
        """

        super().__init__()
        self.name = 'PFNLayer'
        self.last_vfe = last_layer
        if not self.last_vfe:
            out_channels = out_channels // 2
        self.units = out_channels

        if use_norm:
            BatchNorm1d = change_default_args(eps=1e-3, momentum=0.01)(nn.BatchNorm1d)
            Linear = change_default_args(bias=False)(nn.Linear)
        else:
            BatchNorm1d = Empty
            Linear = change_default_args(bias=True)(nn.Linear)
        self.conv = nn.Conv2d(in_channels, self.units, 1, bias=False)
        self.bn = nn.BatchNorm2d(self.units, eps=1e-03, momentum=0.01)

        self.relu = nn.ReLU()
        # self.max = functional.Max()
        self.pool1 = nn.MaxPool2d((1, 5), stride=(1, 5))
        self.pool2 = nn.MaxPool2d((1, 5), stride=(1, 5))
        self.pool3 = nn.MaxPool2d((1, 4), stride=(1, 4))

    def forward(self, inputs):

        # x = self.linear(inputs)
        x = self.conv(inputs)
        # x = self.norm(x.permute(0, 2, 1).contiguous()).permute(0, 2, 1).contiguous()
        x = self.bn(x)
        #x = F.relu(x)
        x = self.relu(x)

        # x_max = torch.max(x, dim=1, keepdim=True)[0]
        #x_max = torch.max(x, dim=3, keepdim=True)[0]
        # x_max = self.max(x, dim=3, keepdim=True)[0]
        x = self.pool1(x)
        x = self.pool2(x)
        x_max = self.pool3(x)

        if self.last_vfe:
            return x_max
        else:
            # x_repeat = x_max.repeat(1, inputs.shape[1], 1)
            # x_concatenated = torch.cat([x, x_repeat], dim=2)
            x_repeat = x_max.repeat(1, 1, 1, inputs.shape[1])
            x_concatenated = torch.cat([x, x_repeat], dim=1)
            return x_concatenated

Unfortunately, the kernel size (1, 5) is only supported by DPUCAHX8H and not by DPUCAHX8L.

To my surprise, test_bin_pointpillars works just fine with this modified model.

However, test_performance_pointpillars does crash after a while (roughly depends on number of threads?) and generates the following error:

(vitis-ai-pytorch) Vitis-AI /workspace/demo/Vitis-AI-Library/samples/pointpillars > env XLNX_POINTPILLARS_PRE_MT=2 ./test_performance_pointpillars pointpillars_kitti_12000_0_pt  pointpillars_kitti_12000_1_pt  -t 16 -s 3000 test_performance_pointpillars.list
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0502 19:10:22.264176 21099 test_performance_pointpillars.cpp:264] writing report to <STDOUT>
I0502 19:10:34.336845 21099 test_performance_pointpillars.cpp:291] waiting for 0/3000 seconds, 16 threads running
I0502 19:10:44.336935 21099 test_performance_pointpillars.cpp:291] waiting for 10/3000 seconds, 16 threads running
Error: CU timeout 
LOAD START:16136
LOAD END  :16136
SAVE START:16096
SAVE END  :16096
CONV START:6001
CONV END  :6001
MISC START:16098
MISC END  :16098
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error: CU timeout 1
Aborted (core dumped)

Can someone explain what I did wrong? Maybe you have a better idea on how to get the example working on Alveo U280?

simonschoening commented 2 years ago

Why was this incorrectly referenced?

SuZhan0322 commented 2 years ago

Hi @simonschoening , Sorry for the late reply. would you mind to share the quantized and compiled xmodel ?

simonschoening commented 2 years ago

Lowering the clock frequency of the DPU seems to fix the issue.

Using a different DPU configuration (one DPU with only one engine instead of the 5+5+4 config by Xilinx) also seems to be stable. To generate such a bitstream, kernel_DPUCAHX8H_1ENGINE.xml should be fixed (#827).

To get the two .xmodel files, I downloaded the model source code from the model zoo and compiled the already trained models (also saved as .xmodel) with DPUCAHX8H as target.

I think that the PointPillars Library does currently only support a batch size of one. Is this correct? Therefore, modifications are necessary to fully utilize the DPUCAHX8H architecture.

SuZhan0322 commented 2 years ago

Hi @simonschoening, thanks for your reply. I think your approach of lowering the DPU clock frequency is correct. Due to the power limitation of the U280 Alveo card, the CNN models cannot run at the highest frequencies. Sometimes frequency scaling-down operation is necessary. Whether the PointPillars Library can run successfully is not related the DPU engine number, but is related to the OPs that the DPU wether can support. And I think your operation of reducing the number of engines in the dpu actually reduces the power consumption of the model when it is running.

SuZhan0322 commented 2 years ago

Hi @simonschoening, Is there any update about this issue? thanks!

SuZhan0322 commented 2 years ago

Hi @simonschoening , Please reopen this issue if you have any other questions. Thanks!