Open bqm1111 opened 3 months ago
Yes, we used the same dimension and number of learnable tokens in every stage, we conducted the ablation studies for dimension and number of learnable tokens, as shown in Table 6. We did not use the different dimension or number of learnable tokens for different stages.
I dont see you mention about learning rate scheduler. Did you fix the learning rate for the training process? Could you elaborate more on your training process? How did you initialize the learnable token?
Yes, we fixed the learning rate and we initialized the learnable token randomly.
Do you use the same dimension and number of learnable tokens in every stage? Did conduct any ablation study on how the variation of these hyperparameters affects overall performance?