How to run evaluate.py correctly

Culturedcucumber commented 1 year ago

Sorry to bother, I have trained a patch following the readme.md and i try to run evaluate.py to test its performance on other detectors. But the results are not so satisfied. I have two questions : The first one is how can i know which epoch's patch performs the best bcause i try to evaluate the result after 1000epochs and that after 300epochs , and the 300 one is much better than the 1000 one. while this may be my problem about evaluate.py . The second one is Is there any pre work I should make before run evaluate.py. I just try to generate clean imgs labels on different detectors every time i run and get mAPs on different detectors. I wonder maybe something is wrong about the weights of different detectors I download from the wedsites in the files given.

ziyannchen commented 1 year ago

There may be something wrong with your running.

The training epoch, as a hyper-parameter, affects the attack performance of the adversarial patch indeed. Generally, a patch will perform better when epoch=1000 than epoch=300 according to our experience. But the margin returns diminish when increasing the epoch from 1000 with other settings unchanged. The main cause is that the optimization has converged when the learning rate drops to a small magnitude.
To evaluate the attack performance, you should put pre-detected label files on clean samples in data/[your dataset]/labels/. If you are using models and datasets supported by our repo, you can use the label files provided on the cloud. Please see details in the document.

Please show more error information to let us figure out the problem.

Culturedcucumber commented 1 year ago

Thank you for your reply!

For training part：I download these models'weights following the download.sh in detlib/weights and put them under each file . Then i just run train_optim.py without any changes to the code and train my patch on v2 for 1000 epoches.

For evaluating part：When for the first time running evaluate.py I met some problems.

it generates labels in ./data/INRIAPerson/Test/pos instead of ./data/test/.../yolov2 and I make some changes to the code and now it generates attack_labels and des_labels in which they should be. And i have used gen_det_labels.py to generate labels under ./data/INRIAPerson/labels following the readme.md.
when I set test_origin and test_gt 'True',it will comes out error information for 'No Ground Truth Files Found', I can't figure out the reason so i just commented out line 145-168 in evaluate.py and get det_map for evaluation. Maybe this is where my problem locates, I will continue trying to understand how the code works.

And that's all the changes I have down with this project after trying to understand how evaluate.py work.I hope i have explained my situation clearly.

ziyannchen commented 1 year ago

Can you show your label file tree here? According to your description of the second point, it looks like it is because you didn't put the ground-truth label files correctly. And actually, if you are using the supported model and datasets, you don't really need to bother to generate the label by yourself. We've provided the model labels & also ground-truth labels, which you can download from the cloud links, see GoogleDrive | BaiduCloud. But if you are using your custom datasets, you do need to process the ground-truth labels (from annotations) into the supported format before setting test_origin/test_gt=True.

Culturedcucumber commented 1 year ago

Thanks for your reply! my label file tree is totally the same as that descriped in your readme.md and I will explain what stupid and immature problem I have discovered in the past several hours. 十分感谢您的回复和对我提出的问题的耐心解答！！由于英语水平有限我接下来就用中文描述观点了。

在您的提醒下我重新检查了代码中我做出改动的地方，发现忽略了代码中比较重要的对检测标签进行软链接的部分，由于我是在windows环境下运行的代码所以原始项目中适用于Linux系统的命令无法生效，导致软链接失败无法获得 ./data/[datasets]/labels中的预先生成的干净图像检测标签。我进行了相应的改动现在已经能够生成对应的det_mAP、gt_mAP和ori_mAP了。

其次，在您的解答中提到了我之前关于训练轮次的增加会导致效果降低的疑惑，1000轮的效果应该是远远强于300轮的，因此我修改回了原先的max_epoch并获得了在Yolov4tiny模型下迭代优化1000轮的补丁，但是在经过evaluate之后得到的效果与论文中的结果还是相去甚远（det_mAP的结果），获得的gt_mAP结果较为相近但是在某些模型下效果会比论文中的效果更好，我简单的思考下是在优化过程中存在一些细小的差异，我会在后续继续多跑几个模型下的结果并做对比得到结果。

最后，我想请问您一个基础的问题就是您在论文中标出的是哪个指标（det_mAP、gt_mAP还是ori_mAP）？

ziyannchen commented 1 year ago

ori_mAP指的是检测器在没有被攻击的情况下，也就是在干净样本上的表现，这个值仅用来作为检测器原始性能参考。 gt_mAP指的是在评测时以annotation为gt的表现; det_mAP指的是在评测时以检测器在干净样本上的检测结果为gt的表现（也就是认为检测器原始性能mAP=1）。我们在文章中的4.1节中的Evaluation Metric已经介绍了我们给出的评测方式哈，我们在论文放的指标是det_mAP的结果。

如果提到了您所说的det_mAP下降得不是很好的情况，可能需要检查哪里出了问题，实验训练配置/代码有没有改动呢？

Culturedcucumber commented 1 year ago

您好，感谢您的解答！十分抱歉没能仔细阅读您的文章就提出这样的问题！

关于实验训练的配置，由于在跑Yolov4模型上时出现了显卡显存不够的问题所以我修改了batch_size为1（其余几个模型没改动但是由于时间太长我只跑了三百轮所以没法作为参考数据进行研究），选用了一张停车标识作为patch初始化的图像，其余地方我没有作出多余修改。我想应该是batch_size缩小导致训练收敛效果不太好或者说由于初始化图片（stop_sign）其中的像素对于对抗效果有反作用（这点是我瞎想出来的）。我会尝试在Yolov3或者Yolov3tiny上进行与您论文中相同训练配置的实验进行复现查看效果。

但是在Yolov2上进行的实验中我并没有做出任何的修改，训练配置选用的是./configs/combine/v2.yaml，这其中存在的问题我将慢慢研究。

再次感谢您的耐心解答！我的问题在您的帮助下得到了很好的解答，我会在后续进行更多的实验来发现解决其中可能存在的问题。

ziyannchen commented 1 year ago

batch size是一个影响比较大的超参数，调整后性能确实会出现一定的影响。你可以按照论文中提到的实验设置再尝试进行训练一下。祝你好运！

VDIGPKU / T-SEA

How to run evaluate.py correctly #7