Closed Refrainlhy closed 4 years ago
Hi @Refrainlhy,
The base walks are generated by the whole graph that contains all the nodes and edges (See the function load_training_data in utils.py). Intuitively, the vocabulary computed by the base walks will have all the nodes of the whole graph. I am very confused about why the vocabulary will miss some nodes.
Thank you for your reply!
I have this question since I find the vocab
is generated by base_walks
, and base_works
are generated by network_data['base']
(base_walks, all_walks = generate_walks(network_data)
). In generate_walks
function, we can see base_walks
are generated by base_walker.simulate_walks(args.num_walks, args.walk_length, schema=args.schema)
, and I see simulate_walks
generate the walks with a random walk. Hence, I think base_walks
may miss some nodes and then the vocab
may miss some nodes?
def train_model(network_data, feature_dic, log_name):
base_walks, all_walks = generate_walks(network_data)
vocab, index2word = generate_vocab([base_walks])
def generate_walks(network_data):
base_network = network_data['Base']
if args.schema is not None:
node_type = load_node_type(file_name + '/node_type.txt')
else:
node_type = None
base_walker = RWGraph(get_G_from_edges(base_network), node_type=node_type)
base_walks = base_walker.simulate_walks(args.num_walks, args.walk_length, schema=args.schema)
all_walks = []
for layer_id in network_data:
if layer_id == 'Base':
continue
tmp_data = network_data[layer_id]
# start to do the random walk on a layer
layer_walker = RWGraph(get_G_from_edges(tmp_data))
layer_walks = layer_walker.simulate_walks(args.num_walks, args.walk_length)
all_walks.append(layer_walks)
print('Finish generating the walks')
return base_walks, all_walks
hi~
Hi @Refrainlhy,
Sorry for the late reply. In simulate_walks
, we generate random walks starting from each node if possible.
hi,
I see you first use the meta-path to sample nodes and get the base_walks. Then you use the base_walks to generate the vocabularies. Then, you generate all_walks according to the type of edge, and only remain these nodes of all_walks which are existed in the vocabulary.
I have two questions on it. (1) why you separate it into two steps, I mean why not use the meta-path sampling without obtaining the all_walks? (2) if we use the base_walks to generate the vocabularies, we will miss some nodes, and then we will not get the embedding of these missed nodes, so why need we to use base_walks instead of all_walks to generate vocabularies?
Could you give me some hints? Thank you very much!