Tsingularity / dift

[NeurIPS'23] Emergent Correspondence from Image Diffusion
https://diffusionfeatures.github.io
MIT License
594 stars 32 forks source link

Batch Inference #14

Closed AmeenAli closed 11 months ago

AmeenAli commented 1 year ago

Thanks for sharing the code!

a question regarding the demo, does the code supports batch inference?

it's written that the input should be a single image tensor and a single text sequence

Args: img_tensor: should be a single torch tensor in the shape of [1, C, H, W] or [C, H, W] prompt: the prompt to use, a string t: the time step to use, should be an int in the range of [0, 1000] up_ft_index: which upsampling block of the U-Net to extract feature, you can choose [0, 1, 2, 3] ensemble_size: the number of repeated images used in the batch to extract features Return:

so I was wondering how to do batch inference

Thanks

Tsingularity commented 1 year ago

Hi, thanks for your interest in our work!

Currently the codebase doesn't support batch inference, i.e., get multiple images' feature map within one forward inference. We didn't implement that because we want to do ensemble inference for each single image to get a more stable feature map, which would already take most of the GPU memory.

But I think batch inference for multiple image is possible and should also be quite straight-forward to implement. Our ensemble forward is basically a batch forward but with the same image input.

Feel free to let me know if you have any more questions!

AmeenAli commented 1 year ago

Thanks for your response! I have implemented it and it seems to be working! one more question + where can i find the code for the exp results reproducing ? i can see only the demo code @Tsingularity

Tsingularity commented 11 months ago

@AmeenAli We are still in the process of organizing evaluation codes and I gonna keep you updated when it's ready. And here're a few things you can refer to for your own implementation:

  1. For semantic correspondence evaluations, we modify upon multiple previous works' codebases including CATs, NeuralCongealing. For Spair, we make our own evaluation pipeline for the following reason: many existing methods are taking both images as input so they need to evaluate on image pairs one-by-one independently. But this is not efficient for our case (especially when Spair has 11k test pairs) because DIFT only needs to query one image's feature once. so we first save all images' features, then doing matching using semantic correspondence.

  2. For geometric correspondence, we modify the code from CAPS.

  3. For temporal correspondence, we modify the code from DINO (for DAVIS) and VideoWalk (for JHMDB).

Hope this helps!

Maram-Helmy commented 8 months ago

@AmeenAli

Hello, Can you please share how you manage to pass batch? Thank you so much in advance.