Can ANRL run the model with customized continuous attributes?

Ailing-Zou commented 5 years ago

Hi, I have two questions concerning about ANRL:

Can ANRL run the model with customized continuous attributes? (no obvious statement)
Can ANRL run the model without label? If it can make it, how can I do some modifications to this code?
Is this method capable of handling weighted edges? If it can make it, do I need to make further modifications on your code? (I indeed found your answer toward other person, but you annotated that code with pass)

continuous attributes example:

0 1100255.0 9.1 29.5261 248515.0 0.2351 0 0 0 0 1 0 0 0 0 0 0 0 1 228151.0 5.64 41.8182 92073.0 0.4239 0 0 0 0 0 1 0 0 0 0 0 0 2 131061.9 0.5345 34.2327 32476.5 0.2621 0 1 0 0 0 0 0 0 0 0 0 0 3 222647.0 7.21 -6.3848 43273.0 0.1299 0 0 0 0 0 1 0 0 0 0 0 0

Look forward to your reply!

cszhangzhen commented 5 years ago

Hi,

Sorry for the late reply.

The code does not need any modification to make it work with customized continuous attributes.
ANRL is an unsupervised model, so it can definitely run without labels. The labels are used to evaluate the quality of learned node embedding. Actually, it depends on your task. For link prediction, labels are not necessary, but for node classification task, you will need node labels for evaluation.
I have updated the code to make it handle weighted edges. Please re-download the code. Just set the weighted flag as True, and the edgelist file should contain edge weight in the third column. In this code, I just give a simple example to show how to utilize the edge weight. Depending on the scale of the edge weights, maybe you want to normalize the edge weights between neighborhoods or other normalization operations.

Ailing-Zou commented 5 years ago

Thank you so much for your reply！ Here is another question. How about running node without features? I mean, some node are with features while others are not. Just turn vectors of node without features into zero? 0 1100255.0 9.1 29.5261 248515.0 0.2351 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 131061.9 0.5345 34.2327 32476.5 0.2621 0 1 0 0 0 0 0 0 0 0 0 0 3 222647.0 7.21 -6.3848 43273.0 0.1299 0 0 0 0 0 1 0 0 0 0 0 0.

Look forward to your reply!

cszhangzhen commented 5 years ago

Constructing all zero features is a possible solution.

Maybe you can also construct some structure features, such as the node degrees, adjacent neighbors or just using the adjacent matrix as node features (the dimension will be high if you have a very large graph).

Ailing-Zou commented 5 years ago

Hi, Thank you so much for your reply! 1. In the read me.txt, you mentioned that:

xxx.feature: this file has n+1 lines. The first line has the following format: node_number feature_dimension The next n lines are as follows: (each node per line ordered by node id) (for node_1) feature_1 feature_2 ... feature_n (for node_2) feature_1 feature_2 ... feature_n

But in the citeseer.feature, the first column seems to be features instead of node number.(I know that your brackets may mean optional but I'd like to confirm from you.)

2. After I put my feature list into ANRL with some zero features, its embedding results are shown below:

(429 normal features without zero imputation):

0,0.99993694,-0.99975246,-0.9995636,0.99997365,-0.99996305,-0.999851,0.9995404,-0.99994814,-0.9997003,0.9996059 1,0.9999473,-0.9996929,-0.9994613,0.999979,-0.9999706,-0.9998148,0.9994188,-0.9999569,-0.99962646,0.99950814

(1137 with zero imputation, so actually there are so many missing features more than normal values):

1565,-0.96549124,0.99866956,0.99644196,-0.97162485,0.96705186,0.9989647,-0.9973055,0.9685862,0.9969617,-0.99787354 1566,-0.96549124,0.99866956,0.99644196,-0.97162485,0.96705186,0.9989647,-0.9973055,0.9685862,0.9969617,-0.99787354

I tried several methods, such as 0 imputation, mean imputation and median imputation, embedding results are not that good.

My idea is that the calculation of result may be unbalanced due to so many imputations, so could you please give me some advice on how to improve my input data?

Look forward to your reply.

cszhangzhen commented 5 years ago

Hi,

1, In citeseer.feature, the first line is: 3312 3703 which means the graph contains 3312 nodes and the dimension of node features is 3703. In the next n lines, the first column is exactly node features, I use (for node_1) to represent this line is the node_1's feature, not meaning (for node_1) is optional. Sorry for the misleading term.

2, The embedding results look like somewhat strange. In 429 normal features without zero imputation, the embedding values are almost +1 and -1, this is the upper and lower bound of tanh activation function. Maybe you need perform some normalizations on the node features (e.g, normalize them into [0, 1]). In 1137 with zero imputation, two node embeddings are the same. This is also a bit of strange, since the embedding also depends on graph structure not only the node features. Are these two isolated nodes?

Ailing-Zou commented 5 years ago

Thank you for your reply!

I'd like to describe this scenario of my application to make my statement more clear. There are in total 1566 nodes (429+1137) and 4170 node pairs. Only 429 nodes have their features while 1137 nodes don't.

So this dataset is very unbalanced, I have to say.

cszhangzhen commented 5 years ago

Ah, I see.

If most of the nodes does not have features, constructing all zero features for them is not a good choice. It will be difficult for most of existing graph embedding methods to tackle this situation.

I would suggest to incorporate some hand-crafted graph structure information for each node.

BTW, in what situation some nodes in a graph have features while some are not?

Ailing-Zou commented 5 years ago

Hi, thank you for your reply! My classmate advices me to try employing heterogeneous model because some nodes are with features while others are not. In short, this graph must be directed, weighted and with features.

cszhangzhen commented 5 years ago

Yes, in heterogeneous information network, nodes and links are of different types; some of them may have features while others do not.

Here is a possible solution: SHNE: Representation Learning for Semantic-Associated Heterogeneous Networks, WSDM 2019. Hope it can help you.

Ailing-Zou commented 5 years ago

Thank you so much, you have a good heart! But it seems that there is no open source code for this paper. If you encounter one paper which satisfies my need with open source code, could you please tell me at that time? Thank you.

cszhangzhen commented 5 years ago

Hi, actually they have released the code. You can find it at https://github.com/chuxuzhang/WSDM2019_SHNE

Ailing-Zou commented 5 years ago

Hi, thank you so much for your reply!

But my application is continuous number of features (all numbers) rather than semantic-associated networks, because stocks are equipped with some features such as market cap(continuous value).

I am not that into this field so that it really cost me a long time to find one suitable model.

cszhangzhen commented 5 years ago

OK, I'll let you know if I encounter the paper satisfying your scenarios.

cszhangzhen / ANRL

Can ANRL run the model with customized continuous attributes? #4