TimDettmers / ConvE

Convolutional 2D Knowledge Graph Embeddings resources
MIT License
673 stars 162 forks source link

Parameter Efficiency of ConvE #64

Open liu-jc opened 4 years ago

liu-jc commented 4 years ago

In your paper, you claim ConvE uses less parameter compared with DistMult. But I think in your code DistMult only uses O(num_entitiesemdding_dim + num_relsembedding_dim) and ConvE uses more parameters. I am a bit confused about your claim. I am afraid I missed something. Can you point out how to verify this claim? Thanks!

TimDettmers commented 4 years ago

It is a bit surprising that DistMult is so large even though it scales linearly but the issue is that knowledge graphs can be large and it scales with the size of the knowledge graph while the convolution and the projection matrix in ConvE scale independently from the knowledge graph.

If you run the models the parameter size is printed, but let me recalculate it by hand for some numbers in the paper to convince you about this claim. In the paper, I claim an embedding size of 128 for DistMult and 96 for ConvE is roughly equivalent in parameters for FB15k-237 (14541 entities and 237 relationships): DistMult: (14541+237)*128 = 1891584 ~ 1.89M ConvE: (14541+237)*96 + 3*3*8 + 4224*96 = 1824264 ~ 1.82M

For ConvE I did not include the bias terms and I used a 2D embedding of size 12x8 which is stacked to 12x16 via [e1;rel]. 3*3*8 are the convolution parameters and 4224*96 the output projection parameters. Note that the output matrix is just the transpose of the entity embedding matrix and does not add any new parameters.

liu-jc commented 4 years ago

Really appreciate your reply! I got it. So the point is that ConvE can use only fewer parameters to achieve a similar performance compared with DistMult, right? I still have a question about 2D conv. When a 3x3 filter is applied on the left part of 12x16 matrix, like [:,:3], it cannot model the interaction between e_s and r_r as the data here only contain the information of e_s. So the 3x3 filter only works for modelling the interaction around the intersection of concatenation. Is it correct? Can you provide some insights about this? Thanks!

TimDettmers commented 4 years ago

Yes, that is correct! I also tried to have an alternating, checker pattern between both embeddings which takes the idea to the extreme, but this did not help more than just concatenating. My intuition is that sometimes you just want to model an entity or relationship on its own, meaning you want to model information that is relationship/entity independent. Having a separate region for this could help with modeling this kind of information.