clamsproject / app-swt-detection

CLAMS app for detecting scenes with text from video input
Apache License 2.0
1 stars 0 forks source link

new implementation for positional encoding #103

Closed keighrim closed 4 months ago

keighrim commented 5 months ago

New Feature Summary

ATM we have implemented three positional encoding scheme in the trainer code, but the way it's implemented for training as well as shipped in the app package (as checkpoint files) is very confusing for both devs and users. I'd like to address this problem by re-implementing the positional encoding in a simpler, robust way.

Current status (https://github.com/clamsproject/app-swt-detection/pull/26) and problems

  1. we have three schemes (fractional, sinusoidal-concat, sinusoidal-add) for positional encoding but they behave different in different ways, which adds
    1. confusion for model developers in terms different dimensions of vector representation,
      • fractional will add 1 more dimension to the backbone CNN vectors
      • sinusoidal-concat will add arbitrary number of dimension (some powers of 2, usually 256 or 512) to the backbone CNN vectors
      • sinusoidal-add does not add additional dimension
    2. confusion between concepts of "absolute" and "relative" encoding. Sinusoidal encodings are inspired by the transfomer paper, where the original problem of "position" was token position in natural language sentences. That made sense since the main purpose of positional encoding in transformer architecture is to provide relative difference "between tokens", and to proxy a sequential representation. However, current CNN-based classification SWT model is not at all sequential, hence the absoluteness of the sinusoidal encoding makes not much sense here.
    3. difficulties in grid search and result probing - this is simply by adding too much variation to the equation.
    4. unnecessary limitation on input data format (#100)

Proposal

I'm proposing

  1. we always use "sinusoidal-add": this will not change the dimension of feature vectors, and we will keep the CNN-backbone dims regardless of usage of positional encoding
  2. we always use "relative" position: since absolute different between time 00:00:07 and 00:00:08 is not relevant information for current isolated image-based classifier.
  3. we expose "on-off" switch of positional encoding in the runtime parameter: this implies that we always train two models given a set of hyperparams - one with sinusoidal-add, and another without - and the evaluation will be delegated to a separate evaluation module, possibly as a part of https://github.com/clamsproject/aapb-evaluations/issues/43

Related

No response

Alternatives

No response

Additional context

No response

keighrim commented 4 months ago

As we can approximate pos_enc_name="none" by using pos_enc_coeff=0, we decided to completely drop the pos_enc_name config key.