facebookresearch / ImageBind

ImageBind One Embedding Space to Bind Them All
Other
8.36k stars 769 forks source link

Using Depth Embeddings in NyuV2 Zero-Shot Classification #107

Open Leeinsu1 opened 9 months ago

Leeinsu1 commented 9 months ago

Thank you for your exceptional work and the code you've provided. I have a question regarding the use of depth embedding in the context of NyuV2 zero-shot classification. For the conversion of depth to disparity, I am utilizing a focal length of 518.857901 and a baseline value of 0.075. However, the accuracy I am achieving is only 45%, which is 10% lower than what is reported in the paper.

Could you possibly advise on any additional steps that might be necessary? Currently, I am conducting operations such as converting depth to disparity, resizing, center cropping, and normalizing. For the normalization process, I am using mean and standard deviation values of 0.0418 and 0.0295, respectively. Additionally, I attempted to apply DepthNorm again after converting to disparity, but it did not yield the desired results.

For the 10-th class, I am using both methods - labeling it as 'others' and selecting the class with the highest cosine similarity from the 18 specified in the paper.

Your guidance on this matter would be greatly appreciated. Thank you.

zhang-ziang commented 9 months ago

@Leeinsu1 I encountered similar problem, could you please share the code you used for discussion? :)

jbrownkramer commented 8 months ago

I am trying to get embeddings in depth images, but I am also struggling since I have to guess at the normalization process.

@Leeinsu1 have you tried using a baseline of 75? If you look at the example disparity file from the omnivore repo, you'll see that the average value is around 16, which indicates a formula for disparity similar to 518.857901 * 75 / d, where d is depth in mm. I think then you might want to do a DepthNorm before normalizing by 0.0418 and 0.0295, since that matches the Omnivore pipeline.

That said, the mean of disparity followed by DepthNorm as defined above is probably about 10x bigger than 0.0418, so I don't know where that came from.

https://github.com/facebookresearch/omnivore/blob/1d55abdc8dfc7bd5cbf69316841ab804d0acf1ca/inference_tutorial.ipynb#L560

StanLei52 commented 8 months ago

Hi there, I recommend you to check out our project ViT-Lens. For the depth experiments, we obtained better performance over ImageBind on the same testing data. Hope that helps.

jbrownkramer commented 8 months ago

@StanLei52 Oh, that looks great! I looked at your paper and code. It seems to follow the same data normalization pipeline as Omnivore and ImageBind. One missing piece of information is the scale in the conversion from depth to disparity. The ViT-Lens code starts by loading pre-computed disparity maps, so that info is not present.

Do you know if disparity is 518.857901 75 / depth or 518.857901 .075 / depth or something else?