Feature extraction with transformers (DINO) #483

Closed blazejdolicki closed 2 years ago

blazejdolicki commented 2 years ago

  1. full code you wrote or full changes you made (git diff)
    <put code or diff here>

    I didn't write any code.

  2. what exact command you run: I'm running this command:

    python3 tools/run_distributed_engines.py \
    hydra.verbose=true \
    config.DATA.TRAIN.DATA_SOURCES=[synthetic] \
    config.DATA.TRAIN.LABEL_SOURCES=[synthetic] \
    config.DATA.TEST.DATA_SOURCES=[synthetic] \
    config.DATA.TEST.LABEL_SOURCES=[synthetic] \

    $CONFIG_PATH leads to a .yaml config file with the following content:

    # @package _global_
    TEST_MODEL: True
      MMAP_MODE: False
      - name: Resize
        size: 256
      - name: CenterCrop
        size: 224
      - name: ToTensor
      - mean:
        - 0.485
        - 0.456
        - 0.406
        name: Normalize
        - 0.229
        - 0.224
        - 0.225
      MMAP_MODE: False
      - imagenet1k_folder
      - disk_folder
      - name: Resize
        size: 256
      - name: CenterCrop
        size: 224
      - name: ToTensor
      - mean:
        - 0.485
        - 0.456
        - 0.406
        name: Normalize
        - 0.229
        - 0.224
        - 0.225
    BACKEND: nccl
    INIT_METHOD: tcp
    NCCL_DEBUG: true
    NUM_NODES: 1
    RUN_ID: auto
    DEVICE: gpu
      EVAL_MODE_ON: true
      NAME: vision_transformer
        CLASSIFIER: token
        DROPOUT_RATE: 0
        DROP_PATH_RATE: 0.1
        HIDDEN_DIM: 384
        IMAGE_SIZE: 224
        MLP_DIM: 1532
        NUM_HEADS: 6
        NUM_LAYERS: 12
        PATCH_SIZE: 16
        QKV_BIAS: true
      PARAM_FILE: {}
    engine_name: extract_features

    while $MODEL_WEIGHTS refers to a .torch file with a DINO model pretrained with VISSL.

  3. full logs you observed:
## Expected behavior:

I would expect the script to extract features from trunk output of the pretrained model, something like this: https://vissl.readthedocs.io/en/v0.1.6/evaluations/feature_extraction.html#extract-features-of-the-trunk-output but instead im getting an AssertionError

## Additional information aka "what I found out so far"
From the assertion error its clear that the script expects the `features` variable to be a list. As far as I understand, if we perform [feature extraction on multiple layers of the trunk](https://vissl.readthedocs.io/en/v0.1.6/evaluations/feature_extraction.html#extract-features-from-several-layers-of-the-trunk), the output is a list of tensors where every tensor corresponds to one specified layer. And I want to get features from a single layer instead of multiple layers, so `features` should be a list containing one tensor. Instead, during debugging, I found out that `features` in my case are a Tensor of shape (1, 64, 384) where 64 is the batch size and 384 is hidden size of the feature vectors. After some digging, the first thing that seems off is the following. During initialization of `BaseSSLMultiInputOutputModel` the method [_get_trunk()](https://github.com/facebookresearch/vissl/blob/484cdecd1a71cb457d8ea74942603b907a23d39d/vissl/models/base_ssl_model.py#L240) is called. This method consists of an if statement:

if is_feature_extractor_model(self.model_config): self.eval_mode = True return FeatureExtractorModel(self.model_config) else: self.eval_mode = False trunk_name = self.model_config.TRUNK.NAME return get_model_trunk(trunk_name)(self.model_config, trunk_name)

You can see the method [is_feature_extractor_mode](https://github.com/facebookresearch/vissl/blob/aa3f7cc33b3b7806e15593083aedc383d85e4a53/vissl/models/model_helpers.py#L52) which looks like so:


the important part is the last line. Following the [docs](https://vissl.readthedocs.io/en/v0.1.6/evaluations/feature_extraction.html#extract-features-of-the-trunk-output) I do not specify FEATURE_EVAL_SETTINGS.LINEAR_EVAL_FEAT_POOL_OPS_MAP in my config file, so it defaults to an empty list. Therefore, `len(model_config.FEATURE_EVAL_SETTINGS.LINEAR_EVAL_FEAT_POOL_OPS_MAP) > 0` is False and so is `is_feature_extractor_model()`.  This results in returning `get_model_trunk()` instead of `FeatureExtractorModel()`. The former returns a model that on forward pass returns a tensor (while `FeatureExtractorModel` returns a list of tensors) which leads to the AssertionError. So all this would indicate that in contrast to the documentation, we always need to provide the `model_config.FEATURE_EVAL_SETTINGS.LINEAR_EVAL_FEAT_POOL_OPS_MAP` argument, therefore afaik this part of docs need to be updated. I can see an example how to do it for trunk only for resnet architecture [here](https://github.com/facebookresearch/vissl/blob/main/configs/config/feature_extraction/trunk_only/rn50_res5.yaml) (by using `Identity`), but I'm not sure which layer to specify for vision transformer, could you help me out with that?
iseessel commented 2 years ago

Hey @blazejdolicki, Have you looked at this tutorial and cross-referenced your config with it? You should have LINEAR_EVAL_FEAT_POOL_OPS_MAP to be a list of the features that you want to extract. Now I'm not sure these options are supported for ViT.

One work around you could try is https://vissl.ai/tutorials/Feature_Extraction_V0_1_6#Extract-the-Output-of-the-Model-Head extracting the head features and either: 1) Creating an identity head. 2) Not setting the head in the model. You should also be able to hack around https://github.com/facebookresearch/vissl/blob/main/vissl/models/trunks/feature_extractor.py to save what you need -- I'd recommend taking a debugger and walking through this code.

blazejdolicki commented 2 years ago

Thanks for your reply. Here's what I did. In this config file the Identity module is used to return the input without changing it.

        ["conv1", ["AvgPool2d", [[10, 10], 10, 4]]],
        ["res2", ["AvgPool2d", [[16, 16], 8, 0]]],
        ["res3", ["AvgPool2d", [[13, 13], 5, 0]]],
        ["res4", ["AvgPool2d", [[8, 8], 3, 0]]],
        ["res5", ["AvgPool2d", [[6, 6], 1, 0]]],
        ["res5avg", ["Identity", []]],

I suppose "res5avg" is the last layer in the CNN trunk. So I tried to replicate it for transformers, where the last layer in the trunk is norm by adding the following to my config:

        ["norm", ["Identity", []]],

Do you think this solution will return correct features? Running with this config does not lead to any errors, the number of returned features corresponds to the number of images in the supplied dataset and the shape of the features are correct. But I'm still thinking how can I verify that the values of the features are correct. So far the only way to confirm that which I came up with is to load the model in plain PyTorch and see if its returned features match with those returned by VISSL. Is there a better way to do it?

iseessel commented 2 years ago

That sounds like a good plan to me!

You could also step through the vissl code with a debugger to make sure it's returning the right thing. I will try to validate quickly later this as well.

If you wanted to contribute a config in a PR with these options so we have this use-case documented that would be amazing!

blazejdolicki commented 2 years ago

Hi @iseessel, took me some time but I added a pull request with my config. Just signed the CLA, so that should be updated soon. Thanks for all the help!