Confusion about the one box dataset

merlinarer commented 2 years ago

Thank you for sharing ! I have a questions about your code. I noticed that the codes load the images in onebox dataset to mix up with the randomly-initialized tensor, as the initial tensor to generate images. init = (args.real_mixin_alpha)*imgs + (1.0-args.real_mixin_alpha)*init However, onebox dataset is the one that we want to generate, isn't it ?

akshaychawla commented 2 years ago

Thank you for the interest in our work! and you are correct in assuming that we want to generate the onebox dataset with this inversion process.

By default, we set args.real_mixin_alpha=0.0 over here: https://github.com/NVlabs/DIODE/blob/80a396d5772528d4c393a301b0a1390eb7e7e039/main_yolo.py#L256 This makes sure that the initialization consists of only random noise i.e the expression becomes: init = 0.0*imgs + (1.0 - 0.0)*init

This argument was added because we were curious to explore what the model inversion process generates when initialized with something other than random noise. An example of such an initialization would be using bounding boxes and corresponding images from a real dataset such as COCO or Pascal's VOC0712. In this scenario, we can use args.real_mixin_alpha to control how much of the initialization is close to the original image.

merlinarer commented 2 years ago

Thank you for the interest in our work! and you are correct in assuming that we want to generate the onebox dataset with this inversion process.

By default, we set args.real_mixin_alpha=0.0 over here:

https://github.com/NVlabs/DIODE/blob/80a396d5772528d4c393a301b0a1390eb7e7e039/main_yolo.py#L256

This makes sure that the initialization consists of only random noise i.e the expression becomes: init = 0.0*imgs + (1.0 - 0.0)*init This argument was added because we were curious to explore what the model inversion process generates when initialized with something other than random noise. An example of such an initialization would be using bounding boxes and corresponding images from a real dataset such as COCO or Pascal's VOC0712. In this scenario, we can use args.real_mixin_alpha to control how much of the initialization is close to the original image.

Thank you for your detailed reply, and I got it. So, are the evaluation results in your paper based on an initialization with a real dataset or not ? Another question is about Table 3. The top and the bottom rows are the results by using original and generated images & lables, respectively, and both are quite clear. My confusion comes from the middle row. You mentioned that it is by using synthetic images conditioned on MS-COCO labels. However, MS-COCO labels contains multiple objects for each image, and how did you use these labels to generate corresponding images ? Looking forward to your reply !

akshaychawla commented 2 years ago

The evaluation results in our paper are based on using random noise as initialization for the DIODE generation process i.e args.real_mixin_alpha was always set to 0.0.
In middle row of Table 3, you are correct that we use multiple object labels for each image when we sample labels from coco. This is because our object detection network Yolo-V3 and its loss function allow predicting multiple objects per image. Hence, during the inversion process, we can condition on multiple bboxes for every image.

In fact, we use the ability to condition on multiple bboxes as part of a unique bbox sampling procedure called false positive sampling (FP sampling). In FP sampling, we discover that during the image generation process, network constantly tries to add context to the image i.e if we condition on a road bike, we often get to see a human generated close to it. To use this unique ability, we aggregate high confidence false positive detections that appear during the generation process leading to more realistic initialization bboxes and generated images. See section 3.1, 5.2 and figure 3.

merlinarer commented 2 years ago

The evaluation results in our paper are based on using random noise as initialization for the DIODE generation process i.e args.real_mixin_alpha was always set to 0.0.

In middle row of Table 3, you are correct that we use multiple object labels for each image when we sample labels from coco. This is because our object detection network Yolo-V3 and its loss function allow predicting multiple objects per image. Hence, during the inversion process, we can condition on multiple bboxes for every image.

In fact, we use the ability to condition on multiple bboxes as part of a unique bbox sampling procedure called false positive sampling (FP sampling). In FP sampling, we discover that during the image generation process, network constantly tries to add context to the image i.e if we condition on a road bike, we often get to see a human generated close to it. To use this unique ability, we aggregate high confidence false positive detections that appear during the generation process leading to more realistic initialization bboxes and generated images. See section 3.1, 5.2 and figure 3.

Very clear, thanks for your reply !

NVlabs / DIODE

Confusion about the one box dataset #4