new implementation for positional encoding

New Feature Summary

ATM we have implemented three positional encoding scheme in the trainer code, but the way it's implemented for training as well as shipped in the app package (as checkpoint files) is very confusing for both devs and users. I'd like to address this problem by re-implementing the positional encoding in a simpler, robust way.

Current status (https://github.com/clamsproject/app-swt-detection/pull/26) and problems

we have three schemes (fractional, sinusoidal-concat, sinusoidal-add) for positional encoding but they behave different in different ways, which adds
1. confusion for model developers in terms different dimensions of vector representation,
  - fractional will add 1 more dimension to the backbone CNN vectors
  - sinusoidal-concat will add arbitrary number of dimension (some powers of 2, usually 256 or 512) to the backbone CNN vectors
  - sinusoidal-add does not add additional dimension
2. confusion between concepts of "absolute" and "relative" encoding. Sinusoidal encodings are inspired by the transfomer paper, where the original problem of "position" was token position in natural language sentences. That made sense since the main purpose of positional encoding in transformer architecture is to provide relative difference "between tokens", and to proxy a sequential representation. However, current CNN-based classification SWT model is not at all sequential, hence the absoluteness of the sinusoidal encoding makes not much sense here.
3. difficulties in grid search and result probing - this is simply by adding too much variation to the equation.
4. unnecessary limitation on input data format (#100)

Proposal

I'm proposing

we always use "sinusoidal-add": this will not change the dimension of feature vectors, and we will keep the CNN-backbone dims regardless of usage of positional encoding
we always use "relative" position: since absolute different between time 00:00:07 and 00:00:08 is not relevant information for current isolated image-based classifier.
we expose "on-off" switch of positional encoding in the runtime parameter: this implies that we always train two models given a set of hyperparams - one with sinusoidal-add, and another without - and the evaluation will be delegated to a separate evaluation module, possibly as a part of https://github.com/clamsproject/aapb-evaluations/issues/43

No response

Alternatives

No response

Additional context

No response

clamsproject / app-swt-detection

new implementation for positional encoding #103

New Feature Summary

Current status (https://github.com/clamsproject/app-swt-detection/pull/26) and problems

Proposal

Related

Alternatives

Additional context