center-for-humans-and-machines / transformer-heads

Toolkit for attaching, training, saving and loading of new heads for transformer models
https://transformer-heads.readthedocs.io/en/latest/
MIT License
248 stars 21 forks source link

Question regarding output_activation="linear" #13

Closed ArchchanaKugathasan closed 1 week ago

ArchchanaKugathasan commented 1 week ago

HeadConfig( name=f"num_tokens_regression", layer_hook=-7, hidden_size=128, # MLP hidden size num_layers=3, # 2 hidden layers in MLP in_size=hidden_size, output_activation="linear", is_causal_lm=False, pred_for_sequence=False, loss_fct="mse", num_outputs=1, is_regression=True, loss_weight=0.0002, )

In the above configuration from one of your scripts (joint_multitask_learning.ipynb) , it is defined as num_layers = 3 output_activation="linear" num_outputs=1

  1. Here, does output_activation (linear) apply only to the last layer (output layer) in the transformer head? or does it mean linear activation is applied for all the layers (3 layers) including the hidden layers in the transformer head?

  2. Where do you define the linear activation in the transformer head code?

yannikkellerde commented 1 week ago

So linear commonly means "no activation". So it is defined in this activation_map which is used in MLPHead

Output activation only specifies the activation after the last layer. I.e. it specifies your output format (linear for any real number, sigmoid for 0-1 and ReLU for positive number)

The hidden activation is currently hard-coded to ReLU. I guess I do not know many situations where one benefits significantly from using something else.

ArchchanaKugathasan commented 1 week ago

Thank you very much for the clarification :)

I have this follow up questions too.

  1. If we haven't defined num_layers in the head_configs as shown below, does it mean that this transformer head only has one layer which is a linear layer? and the activation for this layer can be defined with output_activation variable?

  2. If by default num_layers set to 1, the ReLU activation defined in the code (which is hard-coded to ReLU) will not be applied to this layer, because it's the output layer. Correct?

head_configs = [ HeadConfig( name="mean_regression", layer_hook=-11, in_size=hidden_size, output_activation="linear", is_causal_lm=False, pred_for_sequence=True, loss_fct="mse", num_outputs=1, # Single value (mean) is_regression=True, loss_weight=0.002, ), ]

yannikkellerde commented 1 week ago

Yes, exactly correct.

ArchchanaKugathasan commented 1 week ago

Thank you :)