Open bkj opened 4 years ago
Hi. It's expected.
The original paper is evaluated by liblinear while here it is evaluated by pytorch. Liblinear is optimized by second-order methods and its stop criterion is different from our pytorch implementation.
OK great thanks. Are you able to give some more details about the experimental setup for those numbers? I have the following files for the youtube
dataset:
~/.graphvite/dataset/youtube/
├── youtube_graph.txt
├── youtube-groupmemberships.txt
├── youtube-groupmemberships.txt.gz
├── youtube_label.txt
└── youtube-links.txt.gz
I'm guessing that Table 4 shows the results for predicting youtube_label.txt
-- but that file only has 50767
entries for 31761
unique nodes, instead of |V| = 1,138,499
entries like I'd expect. Thoughts?
Thanks! ~ Ben
If I had to guess w/o digging through your code (yet :) )--
I'm guessing that you convert youtube_label.txt
to a (num_labeled_examples, num_labels)
binary matrix, then use k%
of the rows of the matrix to train the classifier than (100 - k)%
to validate the classifier. In that case, 1% Labeled Nodes
would indicate you used 31761 * 0.01 = 317
nodes to train the classifier in the first column of Table 4.
Is that right? Or am I misunderstanding something?
Yes you're totally right. This setting is exactly inherited from DeepWalk and LINE.
youtube-groupmemberships.txt
is the raw label file, which contains a huge number of communities. Since most communities are really small and noisy, only a few large communities are used. For Youtube, it's the top-47 communities according to the paper of DeepWalk. You can find the generation code here.
Personally I guess the original authors use such small evaluation data just because they are using liblinear on CPU.
Hi --
Can you point me to the code needed to reproduce the results in Table 4 of the paper. I ran
which produced
Those results are similar but differ from the results in Table 4 by a few percentage points. Is the command above correct? Or is this kind of variation expected?
Thanks!