Dataset Description - Githubissues

Tianqi-py commented 1 year ago

Hi there,

I was analyzing the graph dataset SIDER used in this paper and had difficulty understanding how the adj matrix is used in the model.

For example, the train adj has 1141 rows, where each row corresponds to one training data point. But each row has a different length; they are all zeros and ones. Could you explain how the adj matrix is saved here? or maybe add a dataset description file in the repo.

And also, how is the adj matrix split? In the classification task where the features from valid and test data are used to generate the representation of the training data, the adj_train should be asymmetrical and directed.

Thanks for your help in advance!

jacklanchantin commented 1 year ago

the adjacency matrix processing is done here: https://github.com/QData/LaMP/blob/master/utils/utils.py#L86

does that help?

Tianqi-py commented 1 year ago

Thanks for your quick reply:) I understand the full adjacency matrix is symmetrical and generated by this function. Could you please explain what do the lines in the data["train"]["adj"] mean?

jacklanchantin commented 1 year ago

That's the train split adjacency matrix (should be either a full adjacency matrix or sparse representation).

Tianqi-py commented 1 year ago

Thanks again! they are not full adj matrix which should be (1141, 1141) for training data... Is there any chance you could tell me which in what sparse form are they saved? I have difficulty interpreting this matrix...

jacklanchantin commented 1 year ago

There are 1,141 samples (see table 5 in paper)

See the adj_insts var in DataLoader. That's what sider uses

Tianqi-py commented 1 year ago

Thanks for your help:) after checking the code I figure out my confusion. Just for future reference if anybody else is confused about the adj matrix:

As mentioned in the paper, LaMP can make use of the original graph structure for message passing. SIDER is the dataset with a prior graph structure. Normally, the adj matrix of a graph summarizes the graph structure and has the shape of (n,n), with n being the number of nodes in a graph. If the adj matrix is too big, there are many sparse formats to save it.

Particularly, in the implementation of LaMP, the adj matrix is saved as a list, and each element in the list is corresponding to the adj matrix of one node, which explains why each line in the adj matrix has a different length(nodes have different numbers of neighbors). The function "construct_adj_mat" in dataloader.py will convert each line(1d) in the adj matrix into a 2d adj matrix. The final adj matrix used by the model is adj_insts, which is a list of 2d adj matrices with different shapes.

Please let me know if there is any wrongly interpreted idea in the understanding :) Thanks again for your help :)

QData / LaMP

Dataset Description #11