Using STAN for Action Recognition

Hi, using the model for retrieval metrics works. However, your paper mentions the use of STAN for Action Recognition in which I am interested for a study, although there is no mention about the changes required in the pipeline. So far, I have updated the 'task' parameter to 'recognition' and set the val/test evaluator to 'AccMetric'. I

model = dict(
    type='CLIPSimilarity_split', 
    visual_encoder=dict(type='VITCLIPPretrained_STAN', pretrained_model=pretrained_model),
    text_encoder=dict(type='CLIPTextPretrained', pretrained_model=pretrained_model),
    to_float32=True,
    frozen_layers=False,
    **task = "recognition",**
    data_preprocessor=dict(
        type='MultiModalDataPreprocessor',
        preprocessors=dict(
            imgs=dict(
                type='ActionDataPreprocessor',
                mean=[122.771, 116.746, 104.093],
                std=[68.500, 66.632, 70.323],
                format_shape='NCHW'),
            text=dict(type='ActionDataPreprocessor', to_float32=False))),
    tau = 0.01,
    adapter=None)

And I changed

val_evaluator = dict(type='RetrievalMetric')

val_evaluator = dict(type='AccMetric')

Yet on running the train, I get

Traceback (most recent call last):
  File "/home/fransh/STAN/STAN/tools/train.py", line 160, in <module>
    main()
  File "/home/fransh/STAN/STAN/tools/train.py", line 156, in main
    runner.train()
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
    self.run_iter(idx, data_batch)
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/mmengine/runner/loops.py", line 128, in run_iter
    outputs = self.runner.model.train_step(
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
    optim_wrapper.update_params(parsed_loss)
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 196, in update_params
    self.backward(loss)
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/mmengine/optim/optimizer/amp_optimizer_wrapper.py", line 125, in backward
    self.loss_scaler.scale(loss).backward(**kwargs)
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 193, in scale
    return apply_scale(outputs)
  File "/home/fransh/miniconda3/envs/STAN/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 191, in apply_scale
    raise ValueError("outputs must be a Tensor or an iterable of Tensors")
ValueError: outputs must be a Tensor or an iterable of Tensors

I believe there may be a mismatch between the metric and the output of the model using task='recognition'. The forward method of the CLIPSimilarity_split (mmaction/models/recognizers/clip_similarity.py) method takes a 'mode' that I assume has to be set for validation purposes? But given that the Runner of MMaction is abstracted away, how can I change it to 'predict' for validation/testing?

Suggestions and/or an example AR task very much appreciated. Thank you.

farewellthree / STAN

Using STAN for Action Recognition #14