Open liu-jc opened 4 years ago
It is a bit surprising that DistMult is so large even though it scales linearly but the issue is that knowledge graphs can be large and it scales with the size of the knowledge graph while the convolution and the projection matrix in ConvE scale independently from the knowledge graph.
If you run the models the parameter size is printed, but let me recalculate it by hand for some numbers in the paper to convince you about this claim. In the paper, I claim an embedding size of 128 for DistMult and 96 for ConvE is roughly equivalent in parameters for FB15k-237 (14541 entities and 237 relationships):
DistMult: (14541+237)*128 = 1891584 ~ 1.89M
ConvE: (14541+237)*96 + 3*3*8 + 4224*96 = 1824264 ~ 1.82M
For ConvE I did not include the bias terms and I used a 2D embedding of size 12x8
which is stacked to 12x16
via [e1;rel]
. 3*3*8
are the convolution parameters and 4224*96
the output projection parameters. Note that the output matrix is just the transpose of the entity embedding matrix and does not add any new parameters.
Really appreciate your reply! I got it. So the point is that ConvE can use only fewer parameters to achieve a similar performance compared with DistMult, right?
I still have a question about 2D conv. When a 3x3
filter is applied on the left part of 12x16
matrix, like [:,:3]
, it cannot model the interaction between e_s
and r_r
as the data here only contain the information of e_s
. So the 3x3
filter only works for modelling the interaction around the intersection of concatenation. Is it correct? Can you provide some insights about this? Thanks!
Yes, that is correct! I also tried to have an alternating, checker pattern between both embeddings which takes the idea to the extreme, but this did not help more than just concatenating. My intuition is that sometimes you just want to model an entity or relationship on its own, meaning you want to model information that is relationship/entity independent. Having a separate region for this could help with modeling this kind of information.
In your paper, you claim ConvE uses less parameter compared with DistMult. But I think in your code DistMult only uses O(num_entitiesemdding_dim + num_relsembedding_dim) and ConvE uses more parameters. I am a bit confused about your claim. I am afraid I missed something. Can you point out how to verify this claim? Thanks!