Closed whoisltd closed 1 year ago
Hi @chenxingphh I don't see memory input of pre hidden state in your model. Why do you not use this? In the Transformer-xl paper, I see the author had used this.
with the text-classification task, don't need this? can you explain this to me? , thanks
The reason Transformer-XL uses the pre hidden state is that the input in machine translation datasets is usually long, for example, it can be a paragraph, so the context alignment is expanded by using the pre hidden state. In this case, the input lengths are all short, so it is not needed.
thanks for your answer 🙌
Hello again!!! @chenxingphh I see your relative position embedding in your transformer xl is different from the author code. And when if I use memory from pre-segment, the loss will not reduce. Can you explain this? I was getting into trouble when building this model with TensorFlow :( Thanks!!!
Hello again!!! @chenxingphh I see your relative position embedding in your transformer xl is different from the author code. And when if I use memory from pre-segment, the loss will not reduce. Can you explain this? I was getting into trouble when building this model with TensorFlow :( Thanks!!!
I'm not familiar with TF. For the reason why loss will not reduce, I think it is possible to check whether the learning rate is set too small, or whether the parameters of the model will be updated during training. For the implemented relative position embedding, I think you can combine the implementation code according to the feedforward calculation formula of relative position embedding as follows (the core code is line 131 to line139 in model_transformer_xl.py). Hope this reply can help you.
Hi @chenxingphh I don't see memory input of pre hidden state in your model. Why do you not use this? In the Transformer-xl paper, I see the author had used this.
with the text-classification task, don't need this? can you explain this to me? , thanks