facebookresearch / PyTorch-BigGraph

Generate embeddings from large-scale graph-structured data.
https://torchbiggraph.readthedocs.io/
Other
3.39k stars 449 forks source link

Can we extract/output embeddings of relations during/after the training? (Knowledge Graph) #146

Open kunwuz opened 4 years ago

kunwuz commented 4 years ago

It's important for some Knowledge Graph research to obtain the relation embedding. For example, in some cases, we need to use some initial pre-trained relation embeddings instead of learning it from zero. However, I haven't found that in the instruction & code yet. Could you please give me some instructions about how to explicitly obtain the relation embedding?

dany-nonstop commented 4 years ago

My understanding is that the model and embeddings are saved in model folder default, and updated every epoch. Theoretically you can replace it and coax the algorithm to pick up from there to continue the training. I believe the code is in the downstream task part of the document

kunwuz commented 4 years ago

My understanding is that the model and embeddings are saved in model folder default, and updated every epoch. Theoretically you can replace it and coax the algorithm to pick up from there to continue the training. I believe the code is in the downstream task part of the document

Thanks for your timely reply! If my understanding of the document is correct, are the relation embeddings stored in the operator's state dict?

# Load the operator's state dict with h5py.File("model/fb15k/model.v50.h5", "r") as hf: operator_state_dict = { "real": torch.from_numpy(hf["model/relations/0/operator/rhs/real"][...]), "imag": torch.from_numpy(hf["model/relations/0/operator/rhs/imag"][...]), } operator = ComplexDiagonalDynamicOperator(400, dynamic_rel_count) operator.load_state_dict(operator_state_dict) comparator = DotComparator()

If so, is there any way to set an initial embedding for those relations? Like the way PBG did in the 'featurized entity'.

adamlerer commented 4 years ago

Hi @kunwuz . @dany-nonstop is correct about where the relation embeddings "live" in checkpoints. We didn't design PBG to make it easy to set initial relation embeddings, and I'm not exactly sure what the use case for this is. If you just want the simplest/hacky change to accomplish this, I would suggest something like just adding a couple lines here

https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/torchbiggraph/train_cpu.py#L440

e.g.

for relation_idx, operator in enumerate(model.rhs_operators):  # maybe lhs_operators too depending on what you're running
    relation_params = list(operator.parameters())  # this should be one or two tensors depending on which operator it is
    relation_params[0].copy_(initial_relation_params[relation_idx])

Details about the names of all the subfields inside of model can be found here:

https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/torchbiggraph/model.py#L775

P.S. This is not what "featurized entity" does. Featurized entities means that you represent an entity as the average of embeddings of a list of "features" that represents an entity, e.g. you could have an embedding for each word, and then represent a "wikipedia page" entity as the average of all the words in that page.