Closed zachares closed 1 year ago
Hi, you are right. Ideally, there should be a mask that removes effect of padding tokens in the loss. However, we have experimented with both (masking and no masking), and found no significant difference in the performance. This version of the code uses no masking to make code less cluttered.
Thank you for the information, that is useful to know and an interesting result! I think I will continue to have questions about the code / work, do you think Github issues is the best place to post them?
Two other questions I have are: 1) in your models you do not expand the vocabulary size of the language model to incorporate prediction of the new tokens you add to the tokenizer. Is this intentional? 2) have you ever had to clip gradients in order to make model training stable?
Hi, great questions!
Yes, this was intentional. The reason was to keep the pre-trained T5 model as is. However, it might be a good experiment to check how the model would perform if you add <__no_node__>
, etc to the vocabulary and fine-tune, for example, on WebNLG. I suspect the difference would be small, but I might be wrong.
No, I have not clipped any gradients, but since the code is based on Pytorch Lightning, you can just add
--gradient_clip_val value
and see how it affects the training.
Thank you for all the information!
I think the loss on this (line)[https://github.com/IBM/Grapher/blob/main/misc/utils.py#L19] should be ignore the padding tokens in the sequence, because they are just there because of variable sequence length. Let me know what you think.