afshinrahimi / geographconv

Semi-supervised User Geolocation via Graph Convolutional Networks
67 stars 18 forks source link

Questions regarding the dataset and best hyper-paramters #1

Open felixgwu opened 5 years ago

felixgwu commented 5 years ago

Hi,

Thank you for sharing the code of this impressive work. I have two questions regarding how to reproduce the results in the paper.

  1. Based on the paper, it seems that the TWITTER-WORLD dataset is larger than the TWITTER-US; however, when I downloaded the data from this link, I found that the files in the na folder is larger than the ones in the world folder, which confuses me. I wonder if there is a naming typo here.

  2. I tried the following command to get the GCN results with the default hyper-paramters on GEOTEXT: python gcnmain.py -save -i cmu -d data/cmu -enc latin1 Unfortunately, I only get: PM dev results: PM Mean: 565 Median: 103 Acc@161: 54 PM test results: PM Mean: 578 Median: 99 Acc@161: 53 This is a lot worse than the Mean: 546 Median: 45 Acc@161: 60 reported in the paper. Could you please share the commands you used to produce the amazing GCN results on all three datasets in Table 1 of the paper?

afshinrahimi commented 5 years ago

Hi Felix,

I should have added the hyperparameters to the readme file (which I will very soon).

CMU: THEANO_FLAGS='device=cuda0,floatX=float32' nice -n 9 python -u gcnmain.py -hid 300 300 300 -bucket 50 -batch 500 -d ./data/cmu/ -enc latin1 -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -highway

NA: THEANO_FLAGS='device=cpu,floatX=float32' python -u gcnmain.py -hid 600 600 600 -bucket 2400 -batch 500 -d ~/data/na/ -mindf 10 -reg 0.0 -dropout 0.5 -cel 15 -highway

WORLD: THEANO_FLAGS='device=cpu,floatX=float32' python -u gcnmain.py -hid 900 900 900 -bucket 2400 -batch 500 -d ~/data/world/ -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -highway

Why Twitter-WORLD is smaller in size than Twitter-US? world has higher number of users, but lower number of tweets per user, so NA is actually bigger in dataset size.

Also note that the random seeds are changed so you might not get the exact results (unfortunately) but in several runs it might be a little better or worse, but in general comparable.

Don't hesitate to contact me if there were more issues.

Afshin

felixgwu commented 5 years ago

Hi Afshin,

Thank you for your instant response.

I ran the first command: THEANO_FLAGS='device=cuda0,floatX=float32' nice -n 9 python -u gcnmain.py -hid 300 300 300 -bucket 50 -batch 500 -d ./data/cmu/ -enc latin1 -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -silent -highway However, I only got dev: Mean: 536 Median: 100 Acc@161: 55 test: Mean: 561 Median: 96 Acc@161: 54

I ran 10 other runs with different seeds (from 1 to 10), but I still can't get an acc@161 higher than 55. Here is the log I got from these 10 runs: https://gist.github.com/felixgwu/8ae4c6e7a887092ae30c82fea6d6db40

I wonder if I made some mistakes.

Here is how I create the environment. I first create a new environment using the requirements.txt file

conda create --name geo --file requirements.txt

However, I got an error when it tried to import lasagne (version 0.1). The error occurred at: from theano.tensor.signal import downsample It seems that theano doesn't have downsample in theano.tensor.signal, so I upgrade both theano and lasagne to the newest version with this command:

pip install --upgrade https://github.com/Theano/Theano/archive/master.zip
pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip

After that I can run the gcnmain.py script without errors. BTW, I use CUDA 8.0. I only have limited experience with Theano. Maybe there is something wrong here.

-Felix

afshinrahimi commented 5 years ago

Hi Felix,

Regarding the Lasagne and Theano update, you're right, they should be upgraded.

Regarding running with default hyperparameters, we shouldn't do that because the default hyperparameters are not suitable for all the three datasets, they're just there so that the code runs (e.g. the default hidden layer is only 100 and the bucket is 300 which should be 300 300 300 and 50, respectively for cmu).

Regarding when you run with the command THEANO_FLAGS='device=cuda0,floatX=float32' nice -n 9 python -u gcnmain.py -hid 300 300 300 -bucket 50 -batch 500 -d ./data/cmu/ -enc latin1 -mindf 10 -reg 0.0 -dropout 0.5 -cel 5 -silent -highway

I had this mistake of putting -silent in the previous command that I put here, If you remove it you'll see the correct results like this: https://gist.github.com/felixgwu/8ae4c6e7a887092ae30c82fea6d6db40 That's why you only got dev: test:

Don't hesitate to send me feedback if something was still wrong, I'd love to fix errors, help.

Thanks Felix.

Afshin

felixgwu commented 5 years ago

Hi Afshin,

Thank you so much! I can finally reproduce your results in the paper on GEOTEXT. At first I couldn't reproduce it even with the correct command, but after I removed the data/cmu/dump.pkl and data/cmu/vocab.pkl and ran the script again, I got the correct results. I wonder if there is something cached here so that I have to delete them every time whenever I use a different set of hyperparameters.

Here is the log in case someone else also wants to reproduce it. https://gist.github.com/felixgwu/8c74a28b040b95635fcf28c5c1e3e078

I'll try the other two larger datasets and hopefully, I can reproduce them. BTW, there are two typos in the command for NA and World datasets in the README file. ~/data/na/ should be ./data/na/ and ~/data/world/ should be ./data/world/

-Felix

afshinrahimi commented 5 years ago

Hi Felix,

Great News.

After running for the first time, the code saves the preprocessed dataset in dump.pkl in the dataset dir. Next time it loads that by default. If that file is made with incorrect hyperparameters, it'll still load it even if the new hyperparameters (e.g. bucket size) is correct. To stop it from doing that we can use -builddata option to force it to reproduce dump.pkl.

Thanks a lot Felix for the typo fixes and all the other help (I'll add them to the repo asap). It made the code easier for everyone else to reproduce.

Afshin