Closed ArchchanaKugathasan closed 1 week ago
So linear commonly means "no activation". So it is defined in this activation_map which is used in MLPHead
Output activation only specifies the activation after the last layer. I.e. it specifies your output format (linear for any real number, sigmoid for 0-1 and ReLU for positive number)
The hidden activation is currently hard-coded to ReLU. I guess I do not know many situations where one benefits significantly from using something else.
Thank you very much for the clarification :)
I have this follow up questions too.
If we haven't defined num_layers in the head_configs as shown below, does it mean that this transformer head only has one layer which is a linear layer? and the activation for this layer can be defined with output_activation variable?
If by default num_layers set to 1, the ReLU activation defined in the code (which is hard-coded to ReLU) will not be applied to this layer, because it's the output layer. Correct?
head_configs = [ HeadConfig( name="mean_regression", layer_hook=-11, in_size=hidden_size, output_activation="linear", is_causal_lm=False, pred_for_sequence=True, loss_fct="mse", num_outputs=1, # Single value (mean) is_regression=True, loss_weight=0.002, ), ]
Yes, exactly correct.
Thank you :)
HeadConfig( name=f"num_tokens_regression", layer_hook=-7, hidden_size=128, # MLP hidden size num_layers=3, # 2 hidden layers in MLP in_size=hidden_size, output_activation="linear", is_causal_lm=False, pred_for_sequence=False, loss_fct="mse", num_outputs=1, is_regression=True, loss_weight=0.0002, )
In the above configuration from one of your scripts (joint_multitask_learning.ipynb) , it is defined as num_layers = 3 output_activation="linear" num_outputs=1
Here, does output_activation (linear) apply only to the last layer (output layer) in the transformer head? or does it mean linear activation is applied for all the layers (3 layers) including the hidden layers in the transformer head?
Where do you define the linear activation in the transformer head code?