THUDM / GATNE

Source code and dataset for KDD 2019 paper "Representation Learning for Attributed Multiplex Heterogeneous Network"
MIT License
525 stars 141 forks source link

question on meth-path random walking #34

Closed Refrainlhy closed 4 years ago

Refrainlhy commented 4 years ago

hi,

I see you first use the meta-path to sample nodes and get the base_walks. Then you use the base_walks to generate the vocabularies. Then, you generate all_walks according to the type of edge, and only remain these nodes of all_walks which are existed in the vocabulary.

I have two questions on it. (1) why you separate it into two steps, I mean why not use the meta-path sampling without obtaining the all_walks? (2) if we use the base_walks to generate the vocabularies, we will miss some nodes, and then we will not get the embedding of these missed nodes, so why need we to use base_walks instead of all_walks to generate vocabularies?

Could you give me some hints? Thank you very much!

cenyk1230 commented 4 years ago

Hi @Refrainlhy,

The base walks are generated by the whole graph that contains all the nodes and edges (See the function load_training_data in utils.py). Intuitively, the vocabulary computed by the base walks will have all the nodes of the whole graph. I am very confused about why the vocabulary will miss some nodes.

Refrainlhy commented 4 years ago

Thank you for your reply!

I have this question since I find the vocab is generated by base_walks, and base_works are generated by network_data['base'] (base_walks, all_walks = generate_walks(network_data)). In generate_walks function, we can see base_walks are generated by base_walker.simulate_walks(args.num_walks, args.walk_length, schema=args.schema), and I see simulate_walks generate the walks with a random walk. Hence, I think base_walks may miss some nodes and then the vocab may miss some nodes?

def train_model(network_data, feature_dic, log_name):
    base_walks, all_walks = generate_walks(network_data)
    vocab, index2word = generate_vocab([base_walks])

def generate_walks(network_data):
    base_network = network_data['Base']
    if args.schema is not None:
        node_type = load_node_type(file_name + '/node_type.txt')
    else:
        node_type = None

    base_walker = RWGraph(get_G_from_edges(base_network), node_type=node_type)
    base_walks = base_walker.simulate_walks(args.num_walks, args.walk_length, schema=args.schema)

    all_walks = []
    for layer_id in network_data:
        if layer_id == 'Base':
            continue

        tmp_data = network_data[layer_id]
        # start to do the random walk on a layer

        layer_walker = RWGraph(get_G_from_edges(tmp_data))
        layer_walks = layer_walker.simulate_walks(args.num_walks, args.walk_length)

        all_walks.append(layer_walks)

    print('Finish generating the walks')

    return base_walks, all_walks
Refrainlhy commented 4 years ago

hi~

cenyk1230 commented 4 years ago

Hi @Refrainlhy,

Sorry for the late reply. In simulate_walks, we generate random walks starting from each node if possible.