Xilinx / Vitis-AI

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards.
https://www.xilinx.com/ai
Apache License 2.0
1.5k stars 635 forks source link

(vai-c-xir) Tensor dimensions change when compiling #974

Closed AdrFebles closed 1 year ago

AdrFebles commented 2 years ago

Hello! I've trained a model in Pytorch with tensor dimensions: (1,1,4,6), where the order is: (Batch, Channels, Height, Width). When I quantize the model I generate a random input with these dimensions and define the quantizer as follows:

input = torch.rand(1, 1, 4, 6)
quantizer = torch_quantizer(quant_mode, model, input,device=device)

When I test the quantized model there is no porblem with it, but when I compile the model and print the subraph.png file I realize the order of the tensors is switched and this model wait for an input with this size (1,4,6,1) with order: (Batch, Height, Width, Channels), which produce an abnormal performance of the model when it runs on ZCU102 board. Loss: 163 whereas the float model have losses of 0.001.

cnn_aut_zcu102_xmodel

I have the runtime class defined as follows:

def runCNNautoenc(runner: "Runner", img, cnt, img_height, batch, n_features):
    """get tensor"""
    batch_size=1
    inputTensors = runner.get_input_tensors()
    outputTensors = runner.get_output_tensors()
    pre_output_size = int(outputTensors[0].get_data_size() / input_ndim[0])
    output_fixpos = outputTensors[0].get_attr("fix_point")
    output_ndim = tuple(outputTensors[0].dims)
    output_scale = 1 / (2**output_fixpos)
    n_of_images = len(img)
    img=img.unsqueeze(0)
    count = 0
    while count < cnt:
        runSize = 1
        """prepare batch input/output """
        inputData = [np.empty(input_ndim, dtype=np.int8, order="C")]
        outputData = [np.empty(output_ndim, dtype=np.int8, order="C")]
        """init input image to input buffer """
        imageRun = inputData[0]
        imageRun[0, ...] = img[(count + 0) % n_of_images].reshape(input_ndim[1:])

        time_start = time.time()
        job_id = runner.execute_async(inputData, outputData)
        runner.wait(job_id)
        time_end = time.time()
        timetotal = time_end - time_start
        FPS=1/timetotal
        #print('Total Time Execute_async:',timetotal,'segundos', 'FPS:',FPS)
        """softmax&TopK calculate with batch """
        """Benchmark DPU FPS performance over Vitis AI APIs execute_async() and wait() """
        """Uncomment the following code snippet to include softmax calculation for model’s end-to-end FPS evaluation """
        #for j in range(runSize):
        #    softmax = CPUCalcSoftmax(outputData[0], pre_output_size, output_scale)
        #    TopK(softmax, pre_output_size, "./words.txt")

        count = count + runSize
        out_result=torch.Tensor(outputData[0])
        "Testing the model"
        criterion = nn.MSELoss()
        losses = []
        with torch.no_grad():
            for data in img:
                img=torch.Tensor(imageRun)
                loss=criterion(out,img)
                if loss > 0.017:
                anomaly=1
                else:
                    anomaly=0
        logger.debug(f"Loss={test_loss}")
        logger.debug(f"Anomaly={anomaly}")

Could you please, help me to understand this behavior?
Thanks

AdrFebles commented 2 years ago

Hi!, I have seen in the user guide that DPU works with BHWC format. but I trained the network with BCHW tensor order, is this wrong?

MichaelX99 commented 2 years ago

Every model I've tried to quantize has complained during the verify_xmodel method performed at the end that the pytorch_nndct model output shape is not the same as the XIR output shape.

I even tried the example resnet18_quant files and the error persists.

THIS IS A HUGE BUG since PyTorch is ONLY B, C, H, W and XIR is ONLY B, H, W, C!!!!!!!!!!! Note, I've only attempted quantizing and compiling on the CPU.

MichaelX99 commented 2 years ago

I did a bit of digging and this does not occur during 1.4.1.978 git hash 9f3d6db. It seems to be something in the nndct_shared deploy_optimizer. In the 2.x many optimizations are called over the nndct_graph and specifically there is a function called layout_transform which does NOT change the format.

This is in stark contrast to the 1.4.1 version of get_deploy_graph_list which calls completely different graph optimizations

michael-person commented 2 years ago

So the solution to my problem was running data through the quantized_model during the xmodel deployment stage as well as the calibration phase... Some documentation and less finicky tools would be a great addition...

@AdrFebles are you reshaping the data to be B, H, W, C before you pass them to the VART code executed in runCNNautoenc? Or are you keeping it in the PyTorch native B, C, H, W? I believe that you will need to do everything during the quantization and compilation stages using data in B, C, H, W but then during deployment with the final xmodel use B, H, W, C

AdrFebles commented 2 years ago

So the solution to my problem was running data through the quantized_model during the xmodel deployment stage as well as the calibration phase... Some documentation and less finicky tools would be a great addition...

@AdrFebles are you reshaping the data to be B, H, W, C before you pass them to the VART code executed in runCNNautoenc? Or are you keeping it in the PyTorch native B, C, H, W? I believe that you will need to do everything during the quantization and compilation stages using data in B, C, H, W but then during deployment with the final xmodel use B, H, W, C

Hi! @michael-person, yes I followed the steps of this tutorial which helped me understand how to deploy the model on the board: https://github.com/Xilinx/Vitis-AI-Tutorials/tree/1.4/Design_Tutorials/11-tf2_var_autoenc They train and quantize the model in (B,C,H,W) format, but in the running app it's necessary to make a reshape with the input dimensions of the dpu runner.

michael-person commented 2 years ago

Hmm I'm wondering if maybe the increase in loss you're seeing moving from the float version to the quantized version has to do with an output shape/data type issue rather than a data layout issue. If you're reshaping the input for the VART inference call then I think you're ok.

Can you verify that the shapes between out and img are the same? Also you may need to scale out by the output_scale value, you can see how its used in the CPUCalcSoftmax method here

AdrFebles commented 2 years ago

Thank you @michael-person

qianglin-xlnx commented 1 year ago

Closing since no activity for more than 3 months, please open a new issue if you still have any questions, thanks.