The attention models require that attention_in_features be divisible by n_heads. This is problematic for the genetic tuning as it does not always choose these args to satisfy that requirement. It would be much nicer if all reasonable arg settings we valid. attentention_in_features should be replaced with another argument that is used to compute attentention_in_features by multiplying by n_heads.
The attention models require that
attention_in_features
be divisible byn_heads
. This is problematic for the genetic tuning as it does not always choose these args to satisfy that requirement. It would be much nicer if all reasonable arg settings we valid.attentention_in_features
should be replaced with another argument that is used to computeattentention_in_features
by multiplying byn_heads
.