baaivision / tokenize-anything

[ECCV 2024] Tokenize Anything via Prompting
Apache License 2.0
470 stars 19 forks source link

question about Model D training #11

Closed jetyingjia closed 2 months ago

jetyingjia commented 5 months ago

Awesome work, Congratulations! I have some questions about the Model D training. 1、In this model,Pre-train with [Mask,Concept],this concept means the text embeddings(2560 categories)? Than how get this concept to 1B masks? 2、In this paper, get 2.25TB image embedding. How use this data?

PhyscalX commented 5 months ago

Hi, @jetyingjia

  1. Each mask has a pre-computed image embedding for encoding log target via encode_tgt(...)
  2. The 2.25TB image embedding database contains 1B embeddings for 1B masks, used in 1.

BTW, there should be 60 days to compute 1B EVA-CLIP-E embeddings if using 8 NVIDIA A100 😅.

jetyingjia commented 5 months ago

Hi, @PhyscalX 1、This means the model D‘s classify branch target is the concept distribution(image embeeding project to 2560-dimension distribution logits), not the region pseudo label(many paper use pseudo label, eg:OWL) 2、The idea of learn the concept distribution, have other paper recommended? Thank you!

PhyscalX commented 5 months ago
  1. We have clarified that we use KL divergence loss in Sec 3.1.
  2. This method is used by many CLIP-based distillation papers (e.g. RegionCLIP, a modified Faster R-CNN for Open-Vocabulary Classification). However, it is challenging to integrate this method into SAM with 1B masks.
jetyingjia commented 5 months ago

@PhyscalX Good idea,Do you have the plan to release the full project(including training)? As I want to fine-tune this model in my datasets.

PhyscalX commented 5 months ago

Refer to issue #5, currently, we have no plan to release the full code. Instead, we have released the visual prompter and losses for pre-training and fine-tuning.