3dlg-hcvc / DuoduoCLIP

37 stars 1 forks source link

Question Regarding Feature Dimension Alignment Between ViT-bigG-14 and ViT-B-32 During Training #6

Closed sumuru789 closed 1 month ago

sumuru789 commented 1 month ago

Hello,

I am very interested in your project and have a question regarding the training process. I noticed that the text and image features used in your work were extracted using OpenShape with ViT-bigG-14 (laion2b_s39b_b160k) with a feature dimension of 1280. However, for training, you used ViT-B-32 (laion2B-s34B-b79K), where the multi-view feature dimension is 512. Could you please explain how you managed to align or match these different feature dimensions during training?

Thank you once again for sharing such an outstanding work!

hanhung commented 1 month ago

Hi,

For the training of DuoduoCLIP, we use ViT-B-32 (laion2B-s34B-b79K) to encode the text and image features for training. The image renders are from Zero123 and some we rendered ourselves. The raw text descriptions for each shape comes from OpenShape. And these raw images and text are processed with ViT-B-32 to get embeddings of size 512. ViT-bigG-14 (laion2b_s39b_b160k) embeddings are not used in the training of DuoduoCLIP.

There are some comparisons in the paper where we used ViT-bigG-14 (laion2b_s39b_b160k). For example, in Figure 3 for evaluating the zero shot capabilities of that model. Our reimplemented OpenShape model in Table 2 indicated by OpenShape† was also trained with ViT-bigG-14 embeddings to match the settings in their paper.

I hope that clears things up!

sumuru789 commented 1 month ago

Thank you so much for your detailed explanation; it was incredibly helpful and has clarified many of my questions.

I have one more question: since Zero123 provides renders from many different angles, how do you determine which specific image is used for extracting the image features? Do you select a particular angle, or is it chosen randomly?

Thank you once again for your assistance!

hanhung commented 1 month ago

Zero123 renders objects from random angles via spherical sampling of the cameras. This fits our model's purpose as we want our model to be robust to any kind of viewing angle. We have 12 rendered views for each object, each rendered at random angles. And during training the views are randomly selected for training.

This makes our model robust to different viewing angles of the object during inference. Also note that the pose of the images are not used at all during the training of our model, making our model also pose-free.

sumuru789 commented 1 month ago

Thank you so much for your helpful response! I have one final question: why did you decide to use ViT-B/32 instead of ViT-G/14? I previously tried using ViT-G/14, but it consumed a significant amount of GPU memory. Did you encounter similar issues, which influenced your choice to use ViT-B/32?

Thank you again!

hanhung commented 1 month ago

Yes, it was mainly due to the limitations of GPU memory that we had. While we didn't try ViT-G/14 specifically, we did try ViT-L/14 and it was already too large for our 4 A40s. In addition, we also tried ViT-B/16 which had slightly better results but not by much and used more memory.