Xilinx / Vitis-AI

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards.
https://www.xilinx.com/ai
Apache License 2.0
1.48k stars 628 forks source link

ERROR at pytorch quantize calibration step #449

Closed HSqure closed 2 years ago

HSqure commented 3 years ago

Hello developer! I use ultralystic's yolov3 demo (PyTorch v1.4 version) to train the model and successfully running the evaluation program at the latest Vitis-AI GPU docker: Screenshot from 2021-06-15 17-20-26

But when I try to quantize the model, it shows the message:

[NNDCT_NOTE]: Quantization calibration process start up...

[NNDCT_NOTE]: =>Quant Module is in 'cuda'.

[NNDCT_NOTE]: =>Parsing Model...
aten_op 'meshgrid' parse failed(unsupported)

And then I found out where the issue just comes from:

Screenshot from 2021-06-15 17-27-18-

    @staticmethod
    def _make_grid(nx=20, ny=20):
        yv, xv = torch.meshgrid([torch.arange(ny), torch.arange(nx)])
        return torch.stack((xv, yv), 2).view((1, 1, ny, nx, 2)).float()

I exchange it with individual torch tensor operation functions:

    @staticmethod
    def _make_grid(nx=20, ny=20):

        y=torch.arange(ny)
        x=torch.arange(nx)
        yv=[]
        xv=[]

        for cnt, item in enumerate(y):
            if cnt==0:
                yv=torch.full((1,nx),item)
                xv=x.view(1,nx)
            else:
                yv=torch.cat((yv, torch.full((1,nx),item)), 0)
                xv=torch.cat((xv, x.view(1,nx)), 0)

        return torch.stack((xv, yv.long()), 2).view((1, 1, ny, nx, 2)).float()

But it shows the new error message: Screenshot from 2021-06-15 17-08-03

I can't even find where this guy ImplicitTensorToNum comes from in my project! Anyway, I have no idea how to fix this problem. Looking for some help, thanks!

Fabioni commented 3 years ago

Hey @HSqure, I have the same problem. I can't give you the solution but can explain the problem :D In the docu it's written that the model has to pass the torch.jit.trace test and that test does fail in the make_grid function. The problem is that this function is not traceable.

I am also interested in how xilinx did it in the provided yolo models.

My idea is the following: if you investigate where the make_grid function is used, it is only used in the three layers called just "yolo". (have a look at https://netron.app/?url=https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov3.cfg).

So actually this is only post-processing and not really part of the network I think and we could just remove it from the quantization and put it back afterwards.

But I could not try that idea till now.

HSqure commented 3 years ago

Reply @Fabioni : Thanks for your detailed analysis! I'm curious how Xilinx fix that problem too. Anyway, It's a really cool idea. I'll try it, thanks!

Fabioni commented 3 years ago

reply @Fabioni : Thanks for your solution! I'm curious how Xilinx fix that problem too. Anyway, It's a really cool idea. I'll try it, thanks!

let me know about your findings :-)

I just found that tutorial by xilinx https://github.com/Xilinx/Vitis-Tutorials/tree/master/Machine_Learning/Design_Tutorials/07-yolov4-tutorial

I think the same make_grid problem should be present in yolov4 (since I actually have it with yolov4-scaled), so this should somehow be covered in that tutorial but I couldn’t find something about it 😬

Fabioni commented 3 years ago

@HSqure I got it to work like I proposed. I don't know if it would be called dirty but it works 😅

HSqure commented 3 years ago

@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.image

75 subgraphs: image

Fabioni commented 3 years ago

@HSqure unfortunately I'm struggling with even getting a .xmodel file after the quantization. So I'm a step behind you and can't reproduce your problem right now.

How do you get the xmodel? The export_xmodel() function only produces a .py file when I call it. I read something like I first have to do the quantization with "calib" and then do ones the inference with "test" and only then I can export the xmodel. Is this the right way, how does it work exactly?

image

image

HSqure commented 3 years ago

@HSqure unfortunately I'm struggling with even getting a .xmodel file after the quantization. So I'm a step behind you and can't reproduce your problem right now.

How do you get the xmodel? The export_xmodel() function only produces a .py file when I call it.

I read something like I first have to do the quantization with "calib" and then do ones the inference with "test" and only then I can export the xmodel. Is this the right way, how does it work exactly?

image

image

You have to run calib mode firstly. Then you should run the combo of "test + --deploy". image

Fabioni commented 3 years ago

how does your quant.py look like?

HSqure commented 3 years ago

how does your quant.py look like?

Here I just upload the entire project (excpet those model file, it's over 25MB), come to have a look.

Fabioni commented 3 years ago

@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.!

So now I am having a similar problem. My compiled xmodel consists of 115 subgraphs that all have get_attr("device") == "DPU". Why is it not just one subgraph?

HSqure commented 3 years ago

@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.!

So now I am having a similar problem. My compiled xmodel consists of 115 subgraphs that all have get_attr("device") == "DPU". Why is it not just one subgraph?

After comparing with Xilinx's yolo3 .xmodel(deploy), I think there's a problem in the leaky_relu ops module, which makes the calculation running in the CPU part. Normally, the leaky_relu ops module should be a part of conv module, but in my .xmodel, obviously, leaky_relu ops modules are separated.


Comparing:

Left: My .xmodel Right: Xilinx's .xmodel Screenshot from 2021-07-05 14-54-49


Separated leaky_relu module in my .xmodel:

Screenshot from 2021-07-05 14-55-15


Fabioni commented 3 years ago

Hey @HSqure,

Have a look at https://forums.xilinx.com/t5/AI-and-Vitis-AI/Mapping-LeakyReLU-to-DPU-in-PyTorch-flow/td-p/1253498

There it is written:

This [Mapping LeakyReLU to DPU in PyTorch] should be supported in Vitis-AI 1.4, which is due out next month. For now, if you can change to relu from leakyrelu it should map to the DPU.

So apparently we should get Vitis-Ai 1.4 this month (July 2021) which should support it.


The big strange thing remains: why do they say in the documentation that it is already supported!?

HSqure commented 3 years ago

Hey @HSqure,

Have a look at https://forums.xilinx.com/t5/AI-and-Vitis-AI/Mapping-LeakyReLU-to-DPU-in-PyTorch-flow/td-p/1253498

There it is written:

This [Mapping LeakyReLU to DPU in PyTorch] should be supported in Vitis-AI 1.4, which is due out next month. For now, if you can change to relu from leakyrelu it should map to the DPU.

So apparently we should get Vitis-Ai 1.4 this month (July 2021) which should support it.

The big strange thing remains: why do they say in the documentation that it is already supported!?

Yeah, it doesn't make sense and their .xmodel is just made from the Caffe framework. Now there's the only one way to do it: waiting for evaluating Vitis-AI V1.4.

HSqure commented 3 years ago

@Fabioni Hey, I just update version 1.4 and found that they fix everything and I even didn't do anything.

Screenshot from 2021-07-28 10-46-53

Fabioni commented 3 years ago

@Fabioni Hey, I just update version 1.4 and found that they fix everything and I even didn't do anything.

Hey @HSqure Did you just update the Docker Container (for the compilation), or also the Vitis-Ai runtime on the board?

HSqure commented 3 years ago

@Fabioni Hey, I just update version 1.4 and found that they fix everything and I even didn't do anything.

Hey @HSqure

Did you just update the Docker Container (for the compilation), or also the Vitis-Ai runtime on the board?

I compiled the new V1.4 docker. But unfortunately this xmodel still has some problem while running in WAA detection APP.image

(It's weird that all V1.4 WAA app demo couldn't pass the build test with 2021.1 environment suit. Demo depends on opencv 3.4 but new petalinux sdk use opencv 4.4 environment.)

kct890721 commented 2 years ago

Hi @HSqure 我有一些问题想请教您 方便加一下微信吗? 我的ID是H98798389 感谢

HSqure commented 2 years ago

Hi @HSqure 我有一些问题想请教您 方便加一下微信吗?

我的ID是H98798389 感谢

Hi @kct890721 , 跟这个issue相关的话直接在这下面问就OK,我尽我所能看看。

kct890721 commented 2 years ago

非常感谢您的帮助,我真的非常需要您的帮忙,因为我的毕业报告是quantize yolov5 model但我用了好久都没办法。 我想请问你大概清楚要如何quantize yolov5吗?附件是我的档案以及遇到的问题 1352251 15071374324236 common .txt Model.txt quant_info.txt quantize.txt 15071374341599

HSqure commented 2 years ago

非常感谢您的帮助,我真的非常需要您的帮忙,因为我的毕业报告是quantize yolov5 model但我用了好久都没办法。 我想请问你大概清楚要如何quantize yolov5吗?附件是我的档案以及遇到的问题 1352251 15071374324236 common .txt Model.txt quant_info.txt quantize.txt 15071374341599

quantizer里的每个步骤(calib->test)前后都需要完整运行eval流程,包括从使用datasetloader喂数据到生成mAP结果,此外调用quantizer来进行finetune的时候也需要把eval函数的函数体和指针传进去供quantizer使用。最好保证每个eval都正常运行。你这个似乎是eval出了问题。此外注意必须使用with no_gard():把反向传播禁掉。

kct890721 commented 2 years ago

您好我现在完全使用您的source code去进行quantize并把一些weights相关的从v3换成v5,但遇到以下这些 Screenshot from 2021-11-19 17-33-09 问题

HSqure commented 2 years ago

您好我现在完全使用您的source code去进行quantize并把一些weights相关的从v3换成v5,但遇到以下这些 Screenshot from 2021-11-19 17-33-09 问题

named_buffers是在模型训练后生成的附加信息,你可以使用ultralytics的yolov3进行训练,用我的repo里的代码读取权重并量化。此外,导出的时候记得仅导出权重并且配置参数以适配一下pytorch旧版本,具体已经在repo下README.md中更新了。

kct890721 commented 2 years ago

感谢您详细的解答,想请教您若要使用您的code对yolov5进行量化需要额外更改其他项目吗?因为我看了之后发现yolov5和yolov3并无太大差别除了yaml的config有差别而已。

kct890721 commented 2 years ago

Hey @HSqure,想请问您data照片是怎摆设的,当遇到我这种问题时该如何处理,谢谢您 Screenshot from 2021-11-26 18-08-25

HSqure commented 2 years ago

Hey @HSqure,想请问您data照片是怎摆设的,当遇到我这种问题时该如何处理,谢谢您

Screenshot from 2021-11-26 18-08-25

这里图像数据得额外建一层目录,不能直接给到所有图像数据所在的目录

kct890721 commented 2 years ago

Hey @HSqure,感谢您的回覆,请问额外建一层目录是什么意思,我看了source code并无trace到。

HSqure commented 2 years ago

意思是程序里传进去的测试图像数据集路径只需要指向目录地址即可> Hey @HSqure,感谢您的回覆,请问额外建一层目录是什么意思,我看了source code并无trace到。

kct890721 commented 2 years ago

Hey @HSqure,感谢您的回覆,想请教您因为dataset有load进 print(f'\nData Number: {len(dataset)}\n')里面但为什么还会遇到assert subset_len <= len(dataset)这个问题呢? Screenshot from 2021-11-27 16-54-07

HSqure commented 2 years ago

Hey @HSqure,感谢您的回覆,想请教您因为dataset有load进 print(f'\nData Number: {len(dataset)}\n')里面但为什么还会遇到assert subset_len <= len(dataset)这个问题呢?

Screenshot from 2021-11-27 16-54-07

因为其实这个集里面不同的eval函数分别load了两个不同的测试集,有点乱,还没统一,你可以整理查看一下一下里面所有的dataloader

kct890721 commented 2 years ago

Hey @HSqure,感谢您的回覆,我trace了一下code并无发现有您说的问题,但结果还是无法解决AssertionError这个问题,想请您稍微指点一下,感恩。 quant_fast_finetune .txt

HSqure commented 2 years ago

Hey @HSqure,感谢您的回覆,我trace了一下code并无发现有您说的问题,但结果还是无法解决AssertionError这个问题,想请您稍微指点一下,感恩。 quant_fast_finetune .txt

trace一下subset_len,调整其大小直到满足assert判断要求

kct890721 commented 2 years ago

Hey @HSqure,中午好,想请教您这跟问题大概是出在哪了?为什么会有null现象且有 [VAIQ_WARN]: Node ouptut tensor is not quantized这个问题? Screenshot from 2021-11-29 13-37-09 Screenshot from 2021-11-29 13-36-43

HSqure commented 2 years ago

Hey @HSqure,中午好,想请教您这跟问题大概是出在哪了?为什么会有null现象且有

[VAIQ_WARN]: Node ouptut tensor is not quantized这个问题?

Screenshot from 2021-11-29 13-37-09

Screenshot from 2021-11-29 13-36-43

这个就不清楚了,只能建议确保每个eval函数在每个quant操作步骤的前后都被调用到并且正常运行。

kct890721 commented 2 years ago

Hey @HSqure,晚安,我猜可能是我训练出来的model有问题,方便知道您训练v3时的flow以及source code嘛?

HSqure commented 2 years ago

Hey @HSqure,晚安,我猜可能是我训练出来的model有问题,方便知道您训练v3时的flow以及source code嘛?

我的repo里有训练代码的链接,可以去看一下。

kct890721 commented 2 years ago

Hey @HSqure,想请问您这个anchor_grid = extra_model_info['anchor_grid'] 问题是指训练出来的model没有这个吗?还是其他意思,再麻烦您回覆了。 Screenshot from 2021-11-30 17-25-11

Screenshot from 2021-11-30 17-25-54

HSqure commented 2 years ago

Hey @HSqure,想请问您这个anchor_grid = extra_model_info['anchor_grid']

问题是指训练出来的model没有这个吗?还是其他意思,再麻烦您回覆了。

Screenshot from 2021-11-30 17-25-11

Screenshot from 2021-11-30 17-25-54

这部分不能在模型内,在的话量化的时候会报错的,所以拿出来了。

kct890721 commented 2 years ago

Hey @HSqure,下午好,想请教您当我已经完成量化之后得到.xmodel,若要deploy在FPGA上还需要什么步骤,再请大神稍微讲解一下。

niuxjxlnx commented 2 years ago

Please refer to pytorch format yolo series of models deployment in vitis-ai modelzoo.
https://github.com/Xilinx/Vitis-AI/tree/master/model_zoo#Detection

feng040107 commented 9 months ago

@kct890721 Please solve the following problems image

feng040107 commented 9 months ago

@kct890721 方便分享 model_with_post_precess 的代碼嗎? 如何解決keyerror : anchor_grid

feng040107 commented 9 months ago

@HSqure 出現 name_buffers 是要改train.py 的code ?

useruser2023 commented 8 months ago

@HSqure @Fabioni Despite the fact that it has been three years, we still have this problem in vitis 2.5 for yolov3 and yolov5 quantization,

from models.common import DetectMultiBackend

batch_size = 32
device = select_device("", batch_size=batch_size)
dnn = None
data = "dataset.yaml"
half = False
imgsz = 320
task = "val"
single_cls=False
callbacks=Callbacks()
compute_loss = None
augment = False
workers=8
is_coco = False
quant_mode = "calib"
target = "0x0000000000"
weights = "best.pt"

input = torch.randn([1, 3, 320, 320]).to(device)
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
inspector = Inspector(target)
inspector.inspect(model.model, input, device=device)

@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.image

I commented out the same code and also the method self._make_grid but the same error still persist. Will you please assist me if you can recall anything?