codeKgu / Text-GCN

A PyTorch implementation of "Graph Convolutional Networks for Text Classification." (AAAI 2019)
MIT License
123 stars 25 forks source link

Node label one hot encoded #2

Open matteomedioli opened 3 years ago

matteomedioli commented 3 years ago

Hi, I'm working with Wordnet and Graph Neural Network. How it was possible to encode in one hot encoding format a complete vocabulary? For example, I have 250k different words. How many words do you use for your model? Thanks fin advance!

codeKgu commented 3 years ago

A one hot init on the complete vocabulary would be something like the identity matrix for all words in your vocabulary so 250k nodes. The function that does this in the code is init_node_feats. The number of words for the datasets in the repo can be see in data/corpus and looking at the {dataset}_vocab.txt files. For example, for the r8_presplit dataset there are 7688 nodes. This vocabulary is built from the sentence and words of frequency greater than 5 are kept. See the build_text_graph_dataset function. However, it must also be noted that 250k nodes with one hot initialization would not be feasible as each node would have a initial feature dimension of 250k and the graph would not be able to be loaded in system memory.

riyaj8888 commented 2 years ago

lets assume i have 5000 documents and their 5000 integer labels and in this corpus we got 14000 unique words. according to paper total num of nodes will be ==> total documents + vocab size = 5000+14000= 19000 nodes but for documents we know the labels ,how are you creating the labels for vocab word (nodes) can u clarify on this