cezhang01 / Adjacent-Encoder

Source code of the AAAI-2020 paper "Topic Modeling on Document Networks with Adjacent-Encoder"
12 stars 4 forks source link

Data Type #1

Closed aficionadoai closed 4 years ago

aficionadoai commented 4 years ago

This is a great paper! I was wondering if this can be done by simply using .csv file. Or do my data set has to be formatted as the one used in the paper?

cezhang01 commented 4 years ago

Hi,

Thank you for your interest in our paper. The current code only supports the format as in ./cora file. To run the code on your own datasets, you can choose to reformat your datasets, or manually change the code in data_preparation.py. Please let me know should you have further questions.

aficionadoai commented 4 years ago

@cezhang01 Thank you. In regards to my data set. I just have topic, abstract, and text columns. Would that impact adjacency matrix (NxN): a 0-1 symmetric matrix (A^T==A), and its diagonal elements are supposed to be 1.? Would I be able to run the model without it?

cezhang01 commented 4 years ago

@aficionadoai Actually you can also run our model without adjacency matrix. However, in this case our model decreases to autoencoder, 1-to-1 reconstruction. Since our model is designed for document network (1-to-N reconstruction), one way for you to better use our model is to use KNN to generate a document similarity network in the vocabulary space beforehand, then run our model.

You only need to input network (adjacency matrix) and texts (abstract in your case) to run the model, and do not need topic. Topics are automatically learned during model training.

aficionadoai commented 4 years ago

@cezhang01 Thanks again! In regards to the text, do I need to have it as the same format? In your case, are the text hashed?

cezhang01 commented 4 years ago

@aficionadoai Yes, you need to prepare your texts as Bag-Of-Words (BOW) representations. For example, suppose the vocabulary has five words, voc = ['topic', 'model', 'learning', 'singapore', 'university']. Let's further suppose you have three documents, A, B, C. If document A has text "this topic model is learning topic", after removing stop words ("this" and "is"), the BOW of it should be A = [2, 1, 1, 0, 0]. Document B and C repeat the same process to generate BOW.

Now we have prepared texts (BOW), let's move on to network. Assume there is only one directed link from document A to B, and C is alone. The adjacency matrix should look like

adjacency_matrix = 
[[1, 1, 0],
 [1, 1, 0],
 [0, 0, 1]]

Note that the adjacency matrix is supposed to be symmetric with diagonal elements to be 1.

Finally, if the label of document A and B is "machine learning", while that for C is "society", the label should be label = [1, 1, 2] where label_name = ['1 == machine learning', '2 == society'].

aficionadoai commented 4 years ago

@cezhang01 Makes perfect sense! Thank you and greatly appreciate your help.

cezhang01 commented 4 years ago

@aficionadoai Thank you for your interest in our paper again! It is my responsibility to answer your questions.

aficionadoai commented 4 years ago

@cezhang01 Would this KNN to generate a document similarity network in the vocabulary space beforehand the same as the document-term matrix? Or could I use a similarity matrix?

cezhang01 commented 4 years ago

@aficionadoai No, document-term matrix is documents' content (Bag-Of-Words representation). You can use document-term matrix as input for a KNN classifier, and output K nearest neighbors of each document, which can be the similarity matrix (document-document adjacency matrix). In this case, if document A is one of the K nearest neighbors of B, we deem there is a link between A and B (Note that each link is undirected, from A to B, and from B to A).

There are also other ways to generate adjacency matrix. i) If two movies on YouTube share a certain number of common genres, we draw a link between them. Their respective movie description can be the textual content. ii) If two pieces of news share a lot common tags (different from their category), we draw a link between them, and the news content is their corresponding texts. iii) Two pieces of proteins in our body are connected with a link between, we consider them as a protein-protein pair. Their DNA sequence is textual content. iv) Users on Twitter generate a following-followee social network, where the main textual posts can be the users' content. Feel free to come up with more solutions for generating network.