A question for the transformer-xl model

chenxingphh / text-classification-pytorch

Most common text classification models implemented by pytorch

MIT License

9 stars 1 forks source link

A question for the transformer-xl model #1

Closed whoisltd closed 1 year ago

whoisltd commented 2 years ago

Hi @chenxingphh I don't see memory input of pre hidden state in your model. Why do you not use this? In the Transformer-xl paper, I see the author had used this.

with the text-classification task, don't need this? can you explain this to me? , thanks

chenxingphh commented 2 years ago

Hi @chenxingphh I don't see memory input of pre hidden state in your model. Why do you not use this? In the Transformer-xl paper, I see the author had used this.

with the text-classification task, don't need this? can you explain this to me? , thanks

The reason Transformer-XL uses the pre hidden state is that the input in machine translation datasets is usually long, for example, it can be a paragraph, so the context alignment is expanded by using the pre hidden state. In this case, the input lengths are all short, so it is not needed.

whoisltd commented 2 years ago

thanks for your answer 🙌

whoisltd commented 2 years ago

Hello again!!! @chenxingphh I see your relative position embedding in your transformer xl is different from the author code. And when if I use memory from pre-segment, the loss will not reduce. Can you explain this? I was getting into trouble when building this model with TensorFlow :( Thanks!!!

chenxingphh commented 2 years ago

Hello again!!! @chenxingphh I see your relative position embedding in your transformer xl is different from the author code. And when if I use memory from pre-segment, the loss will not reduce. Can you explain this? I was getting into trouble when building this model with TensorFlow :( Thanks!!!

I'm not familiar with TF. For the reason why loss will not reduce, I think it is possible to check whether the learning rate is set too small, or whether the parameters of the model will be updated during training. For the implemented relative position embedding, I think you can combine the implementation code according to the feedforward calculation formula of relative position embedding as follows (the core code is line 131 to line139 in model_transformer_xl.py). Hope this reply can help you. rel