alibaba / graph-learn

An Industrial Graph Neural Network Framework
Apache License 2.0
1.28k stars 267 forks source link

save embedding of unsupervised graphsage #276

Closed Song-xx closed 1 year ago

Song-xx commented 1 year ago

I added a method called train_and_save in trainer.py, which can train the unsupervised GraphSage model and save all node embeddings after training.

  1. Changes: (1) Added the save_node_embedding_bigdata method in trainer.py, which can save embeddings in several blocks; maximum lines in each block is limited by parameter 'block_max_lines'(default 10,000,000); when batch_size >= block_max_lines, it will use original save_node_embedding function. (2) When I save the embedding with save_node_embedding, it encounters an error: 'TypeError: TextIOWrapper.write takes no keyword arguments'. I tried to solve it and found that the error lies in using .write to write a list of tuples. Therefore, I used .writelines to write a list of strings instead. (3) Enriched the train_unsupervised.py file, which is added the usage of save_node_embedding_bigdata (the codes are commented out to avoid influencing the original code, but I believe these codes are very useful for beginners).

  2. Usage of save_node_embedding_bigdata method is as follows:

    
    ## Add the required functions in your `train.py` file
    def meta_path_sample(ego, node_type, edge_type, ego_name, nbrs_num, sampler):
    """ creates the meta-math sampler of the input ego.
    Args:
    ego: A query object, the input centric nodes/edges
    ego_type: A string, the type of `ego`, node_type such as 'paper' or 'user'.
    ego_name: A string, the name of `ego`.
    nbrs_num: A list, the number of neighbors for each hop.
    sampler: A string, the strategy of neighbor sampling.
    """
    meta_path = []
    hops = range(len(nbrs_num))
    meta_path = ['outV' for i in hops]
    alias_list = [ego_name + '_hop_' + str(i + 1) for i in hops]
    mata_path_string = ""
    for path, nbr_count, alias in zip(meta_path, nbrs_num, alias_list):
        mata_path_string += path + '(' + edge_type + ').'
        ego = getattr(ego, path)(edge_type).sample(nbr_count).by(sampler).alias(alias)
    print("Sampling meta path for {} is {}.".format(node_type, mata_path_string))
    return ego

def node_embedding(graph, model, node_type, edge_type, **kwargs): """ Save node embeddings.

Args:
node_type: such as 'paper' or 'user'. 
edge_type: such as 'node_type' or 'edge_type'.
Return:
iterator, ids, embedding.
"""
tfg.conf.training = False
ego_name = 'save_node_' + node_type
seed = graph.V(node_type).batch(kwargs.get('batch_size', 64)).alias(ego_name)
query_save = meta_path_sample(seed, node_type, edge_type, ego_name, kwargs.get('nbrs_num', [10, 5]), kwargs.get('sampler', 'random_without_replacement')).values()
dataset = tfg.Dataset(query_save, window=kwargs.get('window', 10))
ego_graph = dataset.get_egograph(ego_name)
emb = model.forward(ego_graph)
return dataset.iterator, ego_graph.src.ids, emb

define hyperparameters

node_type = '...' edge_type = '...' nbr_num = '...'

Define what to be saved before starting the training

save_iter, save_ids, save_emb = node_embedding(graph=lg, model=model, node_type=node_type, edge_type=edge_type, nbrs_num=nbr_num)

Start the training and saving process

trainer = LocalTrainer(save_checkpoint_steps=5, progress_steps=10) save_file = './saved_emb.txt' trainer.save_node_embedding_bigdata(save_iter=save_iter, save_ids=save_ids, save_emb=save_emb, save_file=save_file, block_max_lines=100000, batch_size=batch_size)


Notes:
(1) If batch_size >= block_max_lines, it will use the original save_node_embedding function. The final text file in which you saved the embeddings will have the original name. If batch_size < block_max_lines, it will automatically add an appendix to the original name. For example, if your original text file name is emb.txt, when batch_size < block_max_lines, the final files you obtain will be named as emb_0.txt, emb_1.txt, and so on.