1049451037 / GCN-Align

Code of the paper: Cross-lingual Knowledge Graph Alignment via Graph Convolutional Networks.
128 stars 27 forks source link

Pytorch implementation of GCN-Align #4

Closed DexterZeng closed 5 years ago

DexterZeng commented 5 years ago

Hi, thank you for providing the source code! It has been of great help! Due to the requirement of the project, I feel like to implement GCN-Align by using Pytorch. Noticing that for the original GCN paper, it has both tensorflow and pytorch implementation. I tried to convert GCN-Align to the pytorch version by following the transitions between GCN tensorflow and GCN pytorch. Nevertheless, the performance is surprisingly low for my pytorch implementation (cf. https://github.com/DexterZeng/GCN-Align-pytorch) (3% of Hits@1 on zh-en). I guess there might be some points that I've missed but I could not identiy them. I wonder if you have already implemented the pytorch version, or are willing to take a look at the codes (its quite easy since the codes are basically the combination of GCN pytorch and this repo). I really appreciate it!

Some notes if you are willing to take a look:

  1. I merely implemented GCN-SE and neglected attribute info for now.
  2. I did not change layers.py, and modified models.py a little bit (compared with GCN pytorch) by removing the log_softmax function on the output of the last layer. Nevertheless, I guess the GCN model codes are where the problems might be happening. Have you changed the layers.py and models.py (compared with GCN tensorflow)?
  3. For the rest codes, I basically copied the codes in this repo and changed it to follow pytorch style.
  4. I have basically followed the parameters in this repo except for the learning rate... Using 25 would lead to loss explosion and I changed it to 0.5.

Thank you very much for your time!

1049451037 commented 5 years ago

Hi Weixin. Thank you for your kind words. Layers are modified a little bit compared with original GCN. Concretely, it is about the initialization of weight matrices (Ws). For input layer, we initialize W as a normalized normal distribution, which is same as JAPE. For hidden layers, we just see W as a constant identity matrix to avoid overfitting.

Hope it helps. If there is any problem, feel free to comment!

DexterZeng commented 5 years ago

Hi, thank you very much for your detailed explanation! Indeed the performance rises to 18% of Hits@1 on zh-en after setting W in hidden layers to constant matrix. In this case, it seems that what a GCN layer does is just to multiply the adjacency matrix with the input matrix (multiplying W can be omitted since W is a constant identity matrix which will not be updated and has no influence on the result). Is it right to think this way?

As for the input layer, I tried initializing W with normal distribution or normalized normal distribution, whereas the results seem to remain the same (initializing with uniform distribution leads to huge performance drop).

That being said, the performance is still far from the reported results. I followed the parameter setting in this repo (except for learning rate, using 25 leads to much worse results). Do you have any idea which point(s) I am stilling missing?

P.S. I print the learned structural embedding matrix after each epoch, and most elements in it become 0 after several epochs. And the structural embedding matrix begins to change very little after several rounds. I guess this might be a bit weird?

Again, many thanks for your help!

1049451037 commented 5 years ago

I am not familiar with PyTorch. I wonder if it is necessary to restrict no gradient for the sparse matrix? Because the sparse matrix in TensorFlow implementation is constant.

DexterZeng commented 5 years ago

Hi, thanks for the advice. Actually, sparse matrix in Pytorch implementation seems to be constant, too. I have tried to optimize the parameters while it still could not achieve the results produced by tensorflow (a 10% gap actually). I guess there might be some small tricks I’m still missing.

Other than that, I have two questions which might help me better understand the codes:

  1. I wonder why W is considered as a constant identity matrix (to avoid overfitting). In specific, why would the overfitting issue happen in EA task instead of in the tasks mentioned in the original GCN paper (is it because of the scale of data)?

  2. Moreover, keeping W as constant also seems to constrain its extension to more advanced models, say, GAT. Suppose I want to utilize GAT, should I also keep some parameters in that model constant to prevent overfitting?

Thank you very much for your time.

1049451037 commented 5 years ago

For question 1, I think indeed it is because of the scale of data. More specifically, it is because of the size of training data. For EA task, the size of training data is far from its need. I believe if sufficient seeds exist, no overfitting will happen.

For question 2, we are still working on it. We have tried GAT-Align which just replace GCN to GAT. However, the result is not as good as GCN. Maybe some other tricks are needed. I think there must be some other researchers (some of my friends) are also working on it. But I can't share more because it is not good to share others' unpublished idea.

DexterZeng commented 5 years ago

Hi, thanks for the detailed explanation! I see your point. I will close this issue for now since currently I could still not improve the performance of the pytorch implementation.

Anyway, many thanks for your help!