Question regarding the range of bias_p

chijames / KERPLE

Apache License 2.0

16 stars 1 forks source link

Question regarding the range of bias_p #3

Closed zijwang closed 8 months ago

zijwang commented 1 year ago

Thanks for the great work! Can you elaborate why we should initialize bias_p to [0, 2] instead of another maximum value? https://github.com/chijames/KERPLE/blob/73f1ee2d24722cdbe277d473899321218f5cbd00/megatron/mpu/layers.py#L226

chijames commented 1 year ago

Hi,

Thank you for the interest in our work!

According to Corollary 1(a) of our paper, we want $0 < p \leq 2$ so that the resulting kernel is CPD. For the case of log variant in Corollary 1(b), we envision that b (equivalent to the self.bias_p you linked) plays a similar role as p in 1(a). This is why we initialize it between 0 and 2. Since it is not a strict requirement, we do not restrict its range anymore after initialization.

Thanks.

zijwang commented 1 year ago

Thanks @chijames ! I also have another somewhat related question where in kerple the lr seems to decay until 0 (https://github.com/chijames/KERPLE/blob/main/kerple_configs/train.yml#L27) whereas in gpt-neox the lr decays to 10% max lr (https://github.com/EleutherAI/gpt-neox/blob/main/configs/20B.yml#LL48C4-L48C10). Is there a reason why we shouldn't decay to 10% max lr as other work did? I am seeing some bias_a and bias_p values are negative after training kerple log, and I am using the 10% max lr decay. Do you think there is any relationship between the two?

chijames commented 1 year ago

Hi,

Regarding the lr decay question: The config file you cited is for the 20B model training. We can only afford to train 125M models and occasionally 1.3B models. For 125M models, the original config file also does not specify the min lr to be 10% max lr.

Regarding negative bias_a and bias_p, they should be protected according to the clamp operations. Can you verify if these clamp operations are doing their job please?

zijwang commented 1 year ago

Yes I played with clamp and it seems working fine. The only (?) difference is I am using pytorch 1.12 whereas you are using 1.8, but I think that shouldn't matter? The clamp was done in forward, but after backward there is no more clamping, making it possible for the model ckpt to have negative bias_a/bias_p values?

chijames commented 1 year ago

Yes you are right about the forward/backward part. May I know what their numerical values are? If they are tiny I think it's safe to set them to 0 as long as they do not impact the performance. Also did you observe deteriorated extrapolation performance due to negative bias_a/bias_p values?