Open stoneshi-1999 opened 1 year ago
Yes, our method only operates on images at inference stage, like neural style transfer methods. DINO-ViT works only in the inference phase and the OceanBag dataset is only used for testing. If you want to apply this method on other datasets, since both Diffusion and ViT are pre-trained on ImageNet, the inference dataset cannot have too complex a background. And due to the randomness of Diffusion, it is necessary to select appropriate negative labels and hyperparameters, and may run a few times to get a good mask. We have tested it on other clothing types and the code will be updated in the future.
Combining the above two flowcharts, my understanding is that the method is only tuned in the inference phase?
That is, during denoising at the inference stage, only DINO-VIT (VIT_LOSS) is used to compute structure loss and appearance loss?
So the OceanBag Dataset is actually only used for evaluation, not for training?
How should I fine-tune it on other datasets if I expect the method to be more suitable for other datasets?
Looking forward to your reply, thank you so much~