Closed HSqure closed 2 years ago
Hey @HSqure, I have the same problem. I can't give you the solution but can explain the problem :D In the docu it's written that the model has to pass the torch.jit.trace test and that test does fail in the make_grid function. The problem is that this function is not traceable.
I am also interested in how xilinx did it in the provided yolo models.
My idea is the following: if you investigate where the make_grid function is used, it is only used in the three layers called just "yolo". (have a look at https://netron.app/?url=https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov3.cfg).
So actually this is only post-processing and not really part of the network I think and we could just remove it from the quantization and put it back afterwards.
But I could not try that idea till now.
Reply @Fabioni : Thanks for your detailed analysis! I'm curious how Xilinx fix that problem too. Anyway, It's a really cool idea. I'll try it, thanks!
reply @Fabioni : Thanks for your solution! I'm curious how Xilinx fix that problem too. Anyway, It's a really cool idea. I'll try it, thanks!
let me know about your findings :-)
I just found that tutorial by xilinx https://github.com/Xilinx/Vitis-Tutorials/tree/master/Machine_Learning/Design_Tutorials/07-yolov4-tutorial
I think the same make_grid problem should be present in yolov4 (since I actually have it with yolov4-scaled), so this should somehow be covered in that tutorial but I couldn’t find something about it 😬
@HSqure I got it to work like I proposed. I don't know if it would be called dirty but it works 😅
@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.
75 subgraphs:
@HSqure unfortunately I'm struggling with even getting a .xmodel file after the quantization. So I'm a step behind you and can't reproduce your problem right now.
How do you get the xmodel? The export_xmodel() function only produces a .py file when I call it. I read something like I first have to do the quantization with "calib" and then do ones the inference with "test" and only then I can export the xmodel. Is this the right way, how does it work exactly?
@HSqure unfortunately I'm struggling with even getting a .xmodel file after the quantization. So I'm a step behind you and can't reproduce your problem right now.
How do you get the xmodel? The export_xmodel() function only produces a .py file when I call it.
I read something like I first have to do the quantization with "calib" and then do ones the inference with "test" and only then I can export the xmodel. Is this the right way, how does it work exactly?
You have to run calib mode firstly. Then you should run the combo of "test + --deploy".
how does your quant.py look like?
how does your quant.py look like?
Here I just upload the entire project (excpet those model file, it's over 25MB), come to have a look.
@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.!
So now I am having a similar problem. My compiled xmodel consists of 115 subgraphs that all have get_attr("device") == "DPU". Why is it not just one subgraph?
@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.!
So now I am having a similar problem. My compiled xmodel consists of 115 subgraphs that all have get_attr("device") == "DPU". Why is it not just one subgraph?
After comparing with Xilinx's yolo3 .xmodel
(deploy), I think there's a problem in the leaky_relu ops module, which makes the calculation running in the CPU part. Normally, the leaky_relu ops module should be a part of conv module, but in my .xmodel
, obviously, leaky_relu ops modules are separated.
Left: My .xmodel
Right: Xilinx's .xmodel
.xmodel
:Hey @HSqure,
Have a look at https://forums.xilinx.com/t5/AI-and-Vitis-AI/Mapping-LeakyReLU-to-DPU-in-PyTorch-flow/td-p/1253498
There it is written:
This [Mapping LeakyReLU to DPU in PyTorch] should be supported in Vitis-AI 1.4, which is due out next month. For now, if you can change to relu from leakyrelu it should map to the DPU.
So apparently we should get Vitis-Ai 1.4 this month (July 2021) which should support it.
The big strange thing remains: why do they say in the documentation that it is already supported!?
Hey @HSqure,
Have a look at https://forums.xilinx.com/t5/AI-and-Vitis-AI/Mapping-LeakyReLU-to-DPU-in-PyTorch-flow/td-p/1253498
There it is written:
This [Mapping LeakyReLU to DPU in PyTorch] should be supported in Vitis-AI 1.4, which is due out next month. For now, if you can change to relu from leakyrelu it should map to the DPU.
So apparently we should get Vitis-Ai 1.4 this month (July 2021) which should support it.
The big strange thing remains: why do they say in the documentation that it is already supported!?
Yeah, it doesn't make sense and their .xmodel
is just made from the Caffe framework. Now there's the only one way to do it: waiting for evaluating Vitis-AI V1.4.
@Fabioni Hey, I just update version 1.4 and found that they fix everything and I even didn't do anything.
@Fabioni Hey, I just update version 1.4 and found that they fix everything and I even didn't do anything.
Hey @HSqure Did you just update the Docker Container (for the compilation), or also the Vitis-Ai runtime on the board?
@Fabioni Hey, I just update version 1.4 and found that they fix everything and I even didn't do anything.
Hey @HSqure
Did you just update the Docker Container (for the compilation), or also the Vitis-Ai runtime on the board?
I compiled the new V1.4 docker. But unfortunately this xmodel still has some problem while running in WAA detection APP.
(It's weird that all V1.4 WAA app demo couldn't pass the build test with 2021.1 environment suit. Demo depends on opencv 3.4 but new petalinux sdk use opencv 4.4 environment.)
Hi @HSqure 我有一些问题想请教您 方便加一下微信吗? 我的ID是H98798389 感谢
Hi @HSqure 我有一些问题想请教您 方便加一下微信吗?
我的ID是H98798389 感谢
Hi @kct890721 , 跟这个issue相关的话直接在这下面问就OK,我尽我所能看看。
非常感谢您的帮助,我真的非常需要您的帮忙,因为我的毕业报告是quantize yolov5 model但我用了好久都没办法。 我想请问你大概清楚要如何quantize yolov5吗?附件是我的档案以及遇到的问题 common .txt Model.txt quant_info.txt quantize.txt
非常感谢您的帮助,我真的非常需要您的帮忙,因为我的毕业报告是quantize yolov5 model但我用了好久都没办法。 我想请问你大概清楚要如何quantize yolov5吗?附件是我的档案以及遇到的问题 common .txt Model.txt quant_info.txt quantize.txt
quantizer里的每个步骤(calib->test)前后都需要完整运行eval流程,包括从使用datasetloader喂数据到生成mAP结果,此外调用quantizer来进行finetune的时候也需要把eval函数的函数体和指针传进去供quantizer使用。最好保证每个eval都正常运行。你这个似乎是eval出了问题。此外注意必须使用with no_gard():
把反向传播禁掉。
docker
里使用vim修改库源码来debug(重启后会复位默认)。源码非常浅而且容易阅读。您好我现在完全使用您的source code去进行quantize并把一些weights相关的从v3换成v5,但遇到以下这些 问题
您好我现在完全使用您的source code去进行quantize并把一些weights相关的从v3换成v5,但遇到以下这些 问题
named_buffers
是在模型训练后生成的附加信息,你可以使用ultralytics的yolov3进行训练,用我的repo里的代码读取权重并量化。此外,导出的时候记得仅导出权重并且配置参数以适配一下pytorch旧版本,具体已经在repo下README.md
中更新了。
感谢您详细的解答,想请教您若要使用您的code对yolov5进行量化需要额外更改其他项目吗?因为我看了之后发现yolov5和yolov3并无太大差别除了yaml的config有差别而已。
Hey @HSqure,想请问您data照片是怎摆设的,当遇到我这种问题时该如何处理,谢谢您
Hey @HSqure,想请问您data照片是怎摆设的,当遇到我这种问题时该如何处理,谢谢您
这里图像数据得额外建一层目录,不能直接给到所有图像数据所在的目录
Hey @HSqure,感谢您的回覆,请问额外建一层目录是什么意思,我看了source code并无trace到。
意思是程序里传进去的测试图像数据集路径只需要指向目录地址即可> Hey @HSqure,感谢您的回覆,请问额外建一层目录是什么意思,我看了source code并无trace到。
Hey @HSqure,感谢您的回覆,想请教您因为dataset有load进 print(f'\nData Number: {len(dataset)}\n')里面但为什么还会遇到assert subset_len <= len(dataset)这个问题呢?
Hey @HSqure,感谢您的回覆,想请教您因为dataset有load进 print(f'\nData Number: {len(dataset)}\n')里面但为什么还会遇到assert subset_len <= len(dataset)这个问题呢?
因为其实这个集里面不同的eval函数分别load了两个不同的测试集,有点乱,还没统一,你可以整理查看一下一下里面所有的dataloader
Hey @HSqure,感谢您的回覆,我trace了一下code并无发现有您说的问题,但结果还是无法解决AssertionError这个问题,想请您稍微指点一下,感恩。 quant_fast_finetune .txt
Hey @HSqure,感谢您的回覆,我trace了一下code并无发现有您说的问题,但结果还是无法解决AssertionError这个问题,想请您稍微指点一下,感恩。 quant_fast_finetune .txt
trace一下subset_len
,调整其大小直到满足assert
判断要求
Hey @HSqure,中午好,想请教您这跟问题大概是出在哪了?为什么会有null现象且有 [VAIQ_WARN]: Node ouptut tensor is not quantized这个问题?
Hey @HSqure,中午好,想请教您这跟问题大概是出在哪了?为什么会有null现象且有
[VAIQ_WARN]: Node ouptut tensor is not quantized这个问题?
这个就不清楚了,只能建议确保每个eval函数在每个quant操作步骤的前后都被调用到并且正常运行。
Hey @HSqure,晚安,我猜可能是我训练出来的model有问题,方便知道您训练v3时的flow以及source code嘛?
Hey @HSqure,晚安,我猜可能是我训练出来的model有问题,方便知道您训练v3时的flow以及source code嘛?
我的repo里有训练代码的链接,可以去看一下。
Hey @HSqure,想请问您这个anchor_grid = extra_model_info['anchor_grid'] 问题是指训练出来的model没有这个吗?还是其他意思,再麻烦您回覆了。
Hey @HSqure,想请问您这个anchor_grid = extra_model_info['anchor_grid']
问题是指训练出来的model没有这个吗?还是其他意思,再麻烦您回覆了。
这部分不能在模型内,在的话量化的时候会报错的,所以拿出来了。
Hey @HSqure,下午好,想请教您当我已经完成量化之后得到.xmodel,若要deploy在FPGA上还需要什么步骤,再请大神稍微讲解一下。
Please refer to pytorch format yolo series of models deployment in vitis-ai modelzoo.
https://github.com/Xilinx/Vitis-AI/tree/master/model_zoo#Detection
@kct890721 Please solve the following problems
@kct890721 方便分享 model_with_post_precess 的代碼嗎? 如何解決keyerror : anchor_grid
@HSqure 出現 name_buffers 是要改train.py 的code ?
@HSqure @Fabioni Despite the fact that it has been three years, we still have this problem in vitis 2.5 for yolov3 and yolov5 quantization,
from models.common import DetectMultiBackend
batch_size = 32
device = select_device("", batch_size=batch_size)
dnn = None
data = "dataset.yaml"
half = False
imgsz = 320
task = "val"
single_cls=False
callbacks=Callbacks()
compute_loss = None
augment = False
workers=8
is_coco = False
quant_mode = "calib"
target = "0x0000000000"
weights = "best.pt"
input = torch.randn([1, 3, 320, 320]).to(device)
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
inspector = Inspector(target)
inspector.inspect(model.model, input, device=device)
@Fabioni Me too! I separated the post-process part out and it works. And I also pass the dpu compilation step successfully. But unfortunately here's new problem that VART said there're 75 subgraphs(too much) when I running edge app with .xmodel.
I commented out the same code and also the method self._make_grid
but the same error still persist. Will you please assist me if you can recall anything?
Hello developer! I use ultralystic's yolov3 demo (PyTorch v1.4 version) to train the model and successfully running the evaluation program at the latest Vitis-AI GPU docker:
But when I try to quantize the model, it shows the message:
And then I found out where the issue just comes from:
I exchange it with individual torch tensor operation functions:
But it shows the new error message:
I can't even find where this guy
ImplicitTensorToNum
comes from in my project! Anyway, I have no idea how to fix this problem. Looking for some help, thanks!