Is it necessary to download the image from Zero123, I notice that the dataset in huggingface has include the training data, if I skip the image preprocess, and directly download the data in huggingface, do I need download the rendered image provided by Zero123?

3dlg-hcvc / DuoduoCLIP

36 stars 1 forks source link

Is it necessary to download the image from Zero123, I notice that the dataset in huggingface has include the training data, if I skip the image preprocess, and directly download the data in huggingface, do I need download the rendered image provided by Zero123? #7

Open RongkunYang opened 1 week ago

hanhung commented 1 week ago

Hi,

The images provided in the huggingface only contains rendered images for the Objaverse-LVIS split. So if you only plan on running evaluation then it is enough. However, if you want to do training then downloading the full Zero123 images are needed.

Hope that helps!

RongkunYang commented 1 week ago

OK, thank you very much, I would like to use the model to test the CAD retrieval, and the embedding have been saved in huggingface right?

hanhung commented 1 week ago

Yes, if you only want to retrieve objects based on text or images, the embeddings are available on huggingface. You can see text_retrieval.py for an idea on how to do this.

RongkunYang commented 1 week ago

OK, I have tried the text retrieval, and this performs good. However, I notice that the text_retrieval only performs on Objaverse dataset, and the other dataset such 3D Future and ABO does not take into accout, could you share the shape_embed file for the other dataset, or how could we process the shape_embed file?

RongkunYang commented 1 week ago

And I have another question, could the DuoduoCLIP model performs image retrieve 3D models, how could I implement this

hanhung commented 6 days ago

For the image to text retrieval you can follow the Example in the readme to generate embeddings for single or multi-view images and use that as the query. Then you can plug that query into the text_retrieval.py script as the query embedding to search over the objaverse embeddings.

I'm currently busy with a few deadlines, but I'll also include the 3D Future, ABO, and ShapeNet embeddings in a future release for the retrieval part. In the mean time, the renderings for 3D Future, ABO, and ShapeNet are already in the huggingface repo. So it is possible to just take those and get the embeddings with the Example.

RongkunYang commented 1 day ago

OK, got it, the rendered image of 3D Future, ABO, and ShapeNet is stored in the supplement_mv_images.h5, right?

RongkunYang commented 23 hours ago

Also, I download the data in "dataset/data/ViT-B-32_laion2b_s34b_b79k/image_embeddings.h5" and "dataset/data/ViT-B-32_laion2b_s34b_b79k/text_embeddings.npy", may I ask how the image_embeddings and text_embeddings can be used to construct the shape embedding. Or in other words, in the text_retrieval.py, we utilize the "data/objaverse_embeddings/Four_1to6F_bs1600_LT6/shape_emb_objaverse.h5" as the embedding database, how can we obtain the shape embedding with the above image_embeddings and text embeddings?

hanhung commented 18 hours ago

You should be able to obtain the shaped embeddings with just the supplement_mv_images.h5 and supplement_model_to_idx.json (model identifiers to h5 indices) files. The example shows how to encode the raw multi-view images into a shape embedding.

The text and image embeddings under dataset/data/ViT-B-32_laion2b_s34b_b79k are not shape embeddings. They are embeddings obtained by using the pretrained CLIP model ViT-B-32_laion2b_s34b_b79k used as the teacher model for training DuoduoCLIP. The image embeddings here are processed separately by the pretrained CLIP model not the DuoduoCLIP model.

RongkunYang commented 13 hours ago

OK, thank you very much, the shape embedding is encoded by the DuoDuoCLIP image encoder