IBM / Grapher

Code that implements efficient knowledge graph extraction from the textual descriptions
Apache License 2.0
142 stars 32 forks source link

bug(loss not ignoring padding token in sequence) #14

Closed zachares closed 1 year ago

zachares commented 1 year ago

I think the loss on this (line)[https://github.com/IBM/Grapher/blob/main/misc/utils.py#L19] should be ignore the padding tokens in the sequence, because they are just there because of variable sequence length. Let me know what you think.

imelnyk commented 1 year ago

Hi, you are right. Ideally, there should be a mask that removes effect of padding tokens in the loss. However, we have experimented with both (masking and no masking), and found no significant difference in the performance. This version of the code uses no masking to make code less cluttered.

zachares commented 1 year ago

Thank you for the information, that is useful to know and an interesting result! I think I will continue to have questions about the code / work, do you think Github issues is the best place to post them?

zachares commented 1 year ago

Two other questions I have are: 1) in your models you do not expand the vocabulary size of the language model to incorporate prediction of the new tokens you add to the tokenizer. Is this intentional? 2) have you ever had to clip gradients in order to make model training stable?

imelnyk commented 1 year ago

Hi, great questions!

  1. Yes, this was intentional. The reason was to keep the pre-trained T5 model as is. However, it might be a good experiment to check how the model would perform if you add <__no_node__>, etc to the vocabulary and fine-tune, for example, on WebNLG. I suspect the difference would be small, but I might be wrong.

  2. No, I have not clipped any gradients, but since the code is based on Pytorch Lightning, you can just add
    --gradient_clip_val value and see how it affects the training.

zachares commented 1 year ago

Thank you for all the information!