YuxiangRen / Heterogeneous-Deep-Graph-Infomax

HDGI code
59 stars 14 forks source link

DBLP dataset #3

Open shirinmous opened 4 years ago

shirinmous commented 4 years ago

Hi, This dataset is the same one used in ‘Graph-based consensus maximization among multiple supervised and unsupervised models’. You said in your paper that the initial features of the target nodes are the bag-of-words embedding based on profiles. How do you get the authors profiles?

YuxiangRen commented 4 years ago

The DBLP dataset used in this paper is original from ‘Graph-based consensus maximization among multiple supervised and unsupervised models’, but we utilize the extended version from the paper "Heterogeneous Graph Attention Network" (HAN). In this version, the author has some keywords which constitute the profile. You can get this dataset from the code of HAN. I can also provide you with the dataset. If you have any other questions, feel free to let me know.

YuxiangRen commented 4 years ago

Because of the size limitation of Github, I put the DBLP dataset in https://drive.google.com/open?id=1zdF3KGp0sk3ZatEvrF6_QHTuxK40-fz8

shirinmous commented 4 years ago

Thank you for replying. I downloaded dblp dataset from HAN paper repository. But I didn't see any file for authors' features. I find the features for authors in the above link. Thanks again. Are the features obtained from the terms nodes that are connected to the paper nodes in the dblp graph? If I want to obtain the features for nodes of other datasets from DBLP, what do I do?

YuxiangRen commented 4 years ago

I think it may depend on the raw data and the way you decide to use the dataset. In my paper, I use the keywords of papers as the profile. If you can collect other information like organization, gender, title and so on, they can work as the supplementary of the DBLP dataset. However, the DBLP platform itself doesn't provide too much information about authors.

shirinmous commented 4 years ago

That's right. Also for papers, we just have titles not even abstracts. The dimnesion of features is [4075, 334]. Did you do any dimensionality reduction like PCA or LSA?

shirinmous commented 4 years ago

The feature matrix's dimension is [4075*334]. Did you do any dimensionality reduction on the bag of words like PCA or LSA?

YuxiangRen commented 4 years ago

Sorry for the late response. Author features are the elements of a bag-of-words represented of keywords. The size of the key words vocabulary is 334.

shirinmous commented 4 years ago

Hi again, I want to obtain the authors' features, but I don't have enough information for doing that. There are more than 8000 terms in the dataset, and 334 keywords in BOW features. What preprocesses do you apply to get 334 terms from 8000 terms? I can't use your feature matrix because I want to know the authors' features by their names or ids. Thank you