Specification request about the definintion of inductive setting

Thanks to your marvelous work on open-vocabulary segmentation, I'm very interested in this project. However, I am confused about the setting of inductive open-vocabulary segmentation. Especially, in "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", the definition of "inductive" is "training images do not contain any unseen objects even if they are unannotated", which means both pixels and text of unseen objects is forbidden during training. But in this work, "inductive" means "the names of unseen classes in inference are unavailable while training " and I can't find the corresponding code that sieves off the unseen pixels. So, I get a little confused. To summarize my question into an example: if "Human" is defined as a seen class and "Dog" is an unseen class, then whether an image containing a man and a dog can be used for training? Thanks in advance and hope for your reply!

ZiqinZhou66 / ZegCLIP

Specification request about the definintion of inductive setting #17