AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.45k stars 435 forks source link

Difficulty Extracting Features from YOLO-WORLD Model #86

Open n9s8a opened 7 months ago

n9s8a commented 7 months ago

Hi Team,

I'm encountering challenges while attempting to extract features from the YOLO-WORLD model for downstream tasks. Despite efforts to disable object detection or modify the model architecture, I'm unable to isolate features due to the model's complexity and dependencies.

Seeking guidance or suggestions on how to effectively extract features from the YOLO-WORLD model.

Thank you.

wondervictor commented 7 months ago

Hi @n9s8a, thanks for your interest in YOLO-World. Could you provide more details about what features you need to extract. You can try to output the features in the yolo_world/models/detector/syolo_world.py:


@MODELS.register_module()
class YOLOWorldDetector(YOLODetector):
    """Implementation of YOLOW Series"""
    def __init__(self,
                 *args,
                 mm_neck: bool = False,
                 num_train_classes=80,
                 num_test_classes=80,
                 **kwargs) -> None:
        self.mm_neck = mm_neck
        self.num_train_classes = num_train_classes
        self.num_test_classes = num_test_classes
        super().__init__(*args, **kwargs)

    def loss(self, batch_inputs: Tensor,
             batch_data_samples: SampleList) -> Union[dict, list]:
        """Calculate losses from a batch of inputs and data samples."""
        self.bbox_head.num_classes = self.num_train_classes
        img_feats, txt_feats = self.extract_feat(batch_inputs,
                                                 batch_data_samples)
        losses = self.bbox_head.loss(img_feats, txt_feats, batch_data_samples)
        return losses

    def predict(self,
                batch_inputs: Tensor,
                batch_data_samples: SampleList,
                rescale: bool = True) -> SampleList:
        """Predict results from a batch of inputs and data samples with post-
        processing.
        """

        img_feats, txt_feats = self.extract_feat(batch_inputs,
                                                 batch_data_samples)
        # OUTPUT IMAGE FEATURES
        # self.bbox_head.num_classes = self.num_test_classes
        self.bbox_head.num_classes = txt_feats[0].shape[0]
        results_list = self.bbox_head.predict(img_feats,
                                              txt_feats,
                                              batch_data_samples,
                                              rescale=rescale)
        # OUTPUT OBJECT FEATURES
        batch_data_samples = self.add_pred_to_datasample(
            batch_data_samples, results_list)
        return batch_data_samples
n9s8a commented 7 months ago

Hi @wondervictor, Thanks for your response.

I'm currently working on an object retrieval task where I need to extract features from intermediate layers of a pre-trained model. Specifically, I require features from the middle of the backbone layers. However, I'm encountering challenges due to the complex architecture of the model.

Problem:

Obtaining Model Summary: The model lacks a single file or documentation that comprehensively describes its architecture. Without this information, I'm unable to identify the names or indices of the intermediate layers required for feature extraction.

Loading Pre-trained Model: I attempted to load the pre-trained model using model.load_state_dict("model_checkpoint.pth"). this is giving error to me as my model is not correct, I'm unable to obtain a summary of the model, which is crucial for identifying the intermediate layers.

Request for Assistance: I need guidance on how to:

Obtain a comprehensive overview of the model architecture to identify the intermediate layers. Extract features from these intermediate layers for use in my object retrieval task.

Additional Information:

The model's complex architecture complicates manual inspection. Accessing the intermediate layers directly after loading the model would greatly facilitate feature extraction.

Any assistance or suggestions on how to approach this issue would be greatly appreciated. Thank you!