Hi Sir, appreciate for the excellent work and congratulation for the ECCV oral.
I would like to ask some questions.
Can we say that the model have the ability in generalization across categories is because the Aggregating Training and STN?
From the ablation studies Table5, even with the STN the model have a little bit decrease performance in pixel-level on both dataset, but it does improve a lot in MvTecAD and MPDD, especially MPDD. And for the aggregating training style also show a lot improve in Table3. So can we assume that the aggregating training provide more samples to the model and also show positive correlation with STN to learn the transformation to improve the model representation?
How is your registration strategy work with multiple object?
The image registration in ordinary computer vision task is want to find the related point and exchange or explore more information in the images pair. And you apply it into this model, but can you explain how registration work when you want to transform to multiple object, for example the samples in category of tubes in MPDD contain multiple tubes in a image, how it know which tubes should map to which one.
In testing phase, how you setup the input of the model that should come from same target category when your testing image only have 1 image (e.g. MvTecAD carpet). Or you just discard one branch of your model, just take the conv_block + STN + encoder + predictor for further testing?
The few-shot annotation k=2,4,8 in individual setting, I think it mean pick 2,4,8 images (when k=2,4,8) for each category, and theses images are selected randomly or fixed? And for aggregating training phase, when k=2, it mean pick 2 images from each class for training right? also random or fixed?
Your affine transformation is work in image-level space or feature-level space? Even you use 1x1 kernel size for the conv layer, but I saw that the output of each convolutional block are different ([32,64,56,56], [32,128,28,28], [32,256,14,14]), so is it work the transform in feature space?
I have read your paper completely and code briefly, maybe the questions contain some misunderstanding with your concept, please tell me if I understand wrongly~
STN makes it possible to apply large-scale transformations. Otherwise, possibly registration can only be achieved by the limited receptive field of the neural network. On MVTec where most objects are centralized, STN may have limited improvements. But on MPDD, STN is more important since large-scale transformations are often needed.
A better solution may be detecting each object at first. Then registrations could be applied to each object. Naively registering on the whole image may work if it can align part of the objects, compared with no registration.
Augmentations could be used and then we will have many images although k=1.
No. For aggregating training, experiments are conducted with a leave-one-out setting. For each experiment, we only test on the target category. Thus, all images from other categories could be used for the training, the same as the baseline methods TDG+ and DiffNet+. The few-shot number k only affects how many images we can get from the target category. For the individual setting, fixed images are selected to have fair comparisons.
Hi Sir, appreciate for the excellent work and congratulation for the ECCV oral.
I would like to ask some questions.
Can we say that the model have the ability in generalization across categories is because the Aggregating Training and STN? From the ablation studies Table5, even with the STN the model have a little bit decrease performance in pixel-level on both dataset, but it does improve a lot in MvTecAD and MPDD, especially MPDD. And for the aggregating training style also show a lot improve in Table3. So can we assume that the aggregating training provide more samples to the model and also show positive correlation with STN to learn the transformation to improve the model representation?
How is your registration strategy work with multiple object? The image registration in ordinary computer vision task is want to find the related point and exchange or explore more information in the images pair. And you apply it into this model, but can you explain how registration work when you want to transform to multiple object, for example the samples in category of tubes in MPDD contain multiple tubes in a image, how it know which tubes should map to which one.
In testing phase, how you setup the input of the model that should come from same target category when your testing image only have 1 image (e.g. MvTecAD carpet). Or you just discard one branch of your model, just take the conv_block + STN + encoder + predictor for further testing?
The few-shot annotation k=2,4,8 in individual setting, I think it mean pick 2,4,8 images (when k=2,4,8) for each category, and theses images are selected randomly or fixed? And for aggregating training phase, when k=2, it mean pick 2 images from each class for training right? also random or fixed?
Your affine transformation is work in image-level space or feature-level space? Even you use 1x1 kernel size for the conv layer, but I saw that the output of each convolutional block are different ([32,64,56,56], [32,128,28,28], [32,256,14,14]), so is it work the transform in feature space?
I have read your paper completely and code briefly, maybe the questions contain some misunderstanding with your concept, please tell me if I understand wrongly~
Thanks!