Can't seem to get embeddings from ConvKB model

ggerogiokas commented 4 years ago

Description

Getting errors when I try to get embeddings from ConvKB models

Actual Behavior

----> 1 model.get_embeddings( unique_nodes[4] )

/data/anaconda/envs/py36/lib/python3.6/site-packages/ampligraph/latent_features/models/EmbeddingModel.py in get_embeddings(self, entities, embedding_type) 412 413 if embedding_type == 'entity': --> 414 emb_list = self.trained_model_params[0] 415 lookup_dict = self.ent_to_idx 416 elif embedding_type == 'relation':

KeyError: 0

Expected Behavior

return embedding

Steps to Reproduce

Create ConvKB and try to get an embedding.

NicholasMcCarthy commented 4 years ago

Hi @ggerogiokas,

Is your unique_nodes[4] an entity that's already been converted to internal IDs (guessing from the KeyError: 0 message)?

The get_embeddings() function expects the original string literals.

ggerogiokas commented 4 years ago

Hi, @NicholasMcCarthy, I am passing unique entities used in the original triples for model.train(). I have been able to retrieve embeddings for every other type of model with the same numpy array as a training input, so I am bit confused.

NicholasMcCarthy commented 4 years ago

Can you copy a snippet of your code here? ConvKB uses get_embeddings() inherited from the abstract EmbeddingModel class, so there shouldn't be anything different about it.

ggerogiokas commented 4 years ago

It also cannot get the embedding for the relation type. Where it only has one type.

It might have something to do with the way I copy the model_class:

models = [ ComplEx, DistMult, TransE, HolE ] # ComplEx, DistMult, TransE, HolE, ConvKB
model_names = [ 'ComplEx', 'DistMult', 'TransE', 'HolE' ] # 'ComplEx', 'DistMult', 'TransE', 'HolE', 'ConvKB' 

for i, model in enumerate(models):
    model_name = name_stem + '_%s' % model_names[i]
    main(X, model_name, params, model)

def main(X, name_stem, params, model_class):
    ''' X should be a numpy array of shape (n,3) '''

    model = model_class(batches_count=params['batches_count'], seed=params['seed'], epochs=params['epochs'], k=params['k'], eta=params['eta'],
                    # Use adam optimizer with learning rate 1e-3
                    optimizer=params['optimizer'], optimizer_params=params['optimizer_params'],
                    # Use pairwise loss with margin 0.5
                    loss=params['loss'], #loss_params={'margin':0.5},
                    # Use L2 regularizer with regularizer weight 1e-5
                    regularizer=params['regularizer'], regularizer_params=params['regularizer_params'],
                    # Enable stdout messages (set to false if you don't want to display)
                    verbose=params['verbose'])

    # For evaluation, we can use a filter which would be used to filter out
    # positives statements created by the corruption procedure.
    # Here we define the filter set by concatenating all the positives

    num_test_valid = int(.0001*len(X))

    X_train_valid, X_test = train_test_split_no_unseen(X, test_size=num_test_valid)
    X_train, X_valid = train_test_split_no_unseen(X_train_valid, test_size=num_test_valid)

    filter = np.concatenate((X_train, X_test))

    # Fit the model on training and validation set
    model.fit(X_train,
              early_stopping = True,
              early_stopping_params = \
                      {
                          'x_valid': X_valid,       # validation set
                          'criteria':'hits10',         # Uses hits10 criteria for early stopping
                          'burn_in': 1900,              # early stopping kicks in after 100 epochs
                          'check_interval':50,         # validates every 20th epoch
                          'stop_interval':5,           # stops if 5 successive validation checks are bad.
                          'x_filter': filter,          # Use filter for filtering out positives
                          'corruption_entities':'all', # corrupt using all entities
                          'corrupt_side':'s+o'         # corrupt subject and object (but not at once)
                      }
              )

    ranks = evaluate_performance(X_test,
                                model=model,
                                filter_triples=filter,
                                use_default_protocol=True, # corrupt subj and obj separately while evaluating
                                verbose=True)

    # compute and print metrics:
    mrr = mrr_score(ranks)
    hits_50 = hits_at_n_score(ranks, n=50)
    hits_20 = hits_at_n_score(ranks, n=20)
    hits_10 = hits_at_n_score(ranks, n=10)

    print("MRR: %f, Hits@10: %f, Hits@50: %f" % (mrr, hits_10, hits_50))

    with open('score.log', 'a') as log:   
      log.write( "model_name: %s, MRR: %f, Hits@10: %f, Hits@50: %f\n" % (name_stem, mrr, hits_10, hits_50) )

    # Save the model
    example_name = "ampligraph_%s.pkl" % name_stem
    save_model(model, model_name_path = example_name)

    # save the embeddings
    unique_nodes = np.unique( X[:, [0,2]].flatten() )
    embeddings = model.get_embeddings( unique_nodes, embedding_type='entity')

    embedding_df = pd.DataFrame( embeddings, index=[unique_nodes])
    embedding_df.to_csv( '%s_embedding.csv' % name_stem )

NicholasMcCarthy commented 4 years ago

Hi @ggerogiokas,

OK figured it out (it was my bug, sorry!)

ConvKB get_embeddings() does have to override the inherited function, as the internal save function is slightly different due to the extra parameters in the ConvKB model.

If you copy the following snippet into ConvKB.py it should fix the issue (it's just changing the key used to retrieve the trained_model_parameters).

    def get_embeddings(self, entities, embedding_type='entity'):
        """Get the embeddings of entities or relations.

        .. Note ::
            Use :meth:`ampligraph.utils.create_tensorboard_visualizations` to visualize the embeddings with TensorBoard.

        Parameters
        ----------
        entities : array-like, dtype=int, shape=[n]
            The entities (or relations) of interest. Element of the vector must be the original string literals, and
            not internal IDs.
        embedding_type : string
            If 'entity', ``entities`` argument will be considered as a list of knowledge graph entities (i.e. nodes).
            If set to 'relation', they will be treated as relation types instead (i.e. predicates).

        Returns
        -------
        embeddings : ndarray, shape [n, k]
            An array of k-dimensional embeddings.

        """
        if not self.is_fitted:
            msg = 'Model has not been fitted.'
            logger.error(msg)
            raise RuntimeError(msg)

        if embedding_type == 'entity':
            emb_list = self.trained_model_params['ent_emb']
            lookup_dict = self.ent_to_idx
        elif embedding_type == 'relation':
            emb_list = self.trained_model_params['rel_emb']
            lookup_dict = self.rel_to_idx
        else:
            msg = 'Invalid entity type: {}'.format(embedding_type)
            logger.error(msg)
            raise ValueError(msg)

        idxs = np.vectorize(lookup_dict.get)(entities)
        return emb_list[idxs]

I'll open a separate ticket to fix the bug and it'll get folded into the next release. Thanks for bringing this to our attention!

ggerogiokas commented 4 years ago

Thanks, @NicholasMcCarthy I will try the above out!

ggerogiokas commented 4 years ago

All good getting embeddings! Tell me when you push, so I can install the development version.

Accenture / AmpliGraph