ATM we have implemented three positional encoding scheme in the trainer code, but the way it's implemented for training as well as shipped in the app package (as checkpoint files) is very confusing for both devs and users. I'd like to address this problem by re-implementing the positional encoding in a simpler, robust way.
confusion for model developers in terms different dimensions of vector representation,
fractional will add 1 more dimension to the backbone CNN vectors
sinusoidal-concat will add arbitrary number of dimension (some powers of 2, usually 256 or 512) to the backbone CNN vectors
sinusoidal-add does not add additional dimension
confusion between concepts of "absolute" and "relative" encoding. Sinusoidal encodings are inspired by the transfomer paper, where the original problem of "position" was token position in natural language sentences. That made sense since the main purpose of positional encoding in transformer architecture is to provide relative difference "between tokens", and to proxy a sequential representation. However, current CNN-based classification SWT model is not at all sequential, hence the absoluteness of the sinusoidal encoding makes not much sense here.
difficulties in grid search and result probing - this is simply by adding too much variation to the equation.
unnecessary limitation on input data format (#100)
Proposal
I'm proposing
we always use "sinusoidal-add": this will not change the dimension of feature vectors, and we will keep the CNN-backbone dims regardless of usage of positional encoding
we always use "relative" position: since absolute different between time 00:00:07 and 00:00:08 is not relevant information for current isolated image-based classifier.
we expose "on-off" switch of positional encoding in the runtime parameter: this implies that we always train two models given a set of hyperparams - one with sinusoidal-add, and another without - and the evaluation will be delegated to a separate evaluation module, possibly as a part of https://github.com/clamsproject/aapb-evaluations/issues/43
New Feature Summary
ATM we have implemented three positional encoding scheme in the trainer code, but the way it's implemented for training as well as shipped in the app package (as checkpoint files) is very confusing for both devs and users. I'd like to address this problem by re-implementing the positional encoding in a simpler, robust way.
Current status (https://github.com/clamsproject/app-swt-detection/pull/26) and problems
fractional
will add 1 more dimension to the backbone CNN vectorssinusoidal-concat
will add arbitrary number of dimension (some powers of 2, usually 256 or 512) to the backbone CNN vectorssinusoidal-add
does not add additional dimensionProposal
I'm proposing
Related
No response
Alternatives
No response
Additional context
No response