fix cont index to cut words of vocabulary

mro15 commented 5 years ago

Hello, first of all I would like to say that this is an amazing work! I have been testing this work for the text (documents) datasets.

I figure out that the vocabulary size of your approach in the texts datasets differ from the approach of TextGCN. For example in dataset R8 when "wc -l data/corpus/R8_vocab.txt" the vocabulary size is 6791words, but in the paper of TextGCN the size is 7688 words.

I fix this changing the index of cutoff from 5 to 4 in "remove_words.py"

Tiiiger commented 5 years ago

Hi, thanks for reaching out. I would need to double check whether this would affect the performance. Get back to you later.

Tiiiger commented 5 years ago

Hi, sorry for the delay. I have double checked this. There is no significant influence on the model performance across all datasets.

However, since setting cutoff to 5 is the setting in our experiments for the paper, I'd like to keep it for this repo so that people know exactly what we did and how to replicate it. Note that we run all the experiments for TextGCN and TextSGC together so this difference between the TextGCN repo and our repo does not affect the comparison between GCN and SGC in our paper.

Since I know that you are interested in both TextGCN and TextSGC, I want to point out for you that remove_words.py and utils.py are also slightly different from the TextGCN repo.

I will add a note on the differences between their repo and our repo to the README when I am more available. I am closing this but feel free to ask further questions.

Tiiiger / SGC

fix cont index to cut words of vocabulary #3