reproducing results paper

koenvanderveen commented 5 years ago

Hi!, I was playing with your code, great work! I am trying to reproduce the results from your paper on WikiSQL. However, when using run.sh I get results in the 70.3 ballpark (on dev set) instead of the reported 72.2%. Are there any parameters I need to change to get the reported results?

Thanks in advance!

crazydonkey200 commented 5 years ago

Thanks for asking the question. The result in the paper is obtained using the default parameter in the repo on a AWS g3.xlarge machine.

There are 3 sources for the difference between experiments (and the sensitivity of RL training tends to amplify it): (1) The stochasticity in random seed. (2) The stochasticity in asynchronous training. (3) Different machine configuration. In my experience, sometimes even the same type of instances can have some difference due to the cloud.

But the difference you saw are larger than the standard deviation in my experiments, so I would also like to investigate it.

I am working on an update to fix (1) and (2) to make experiments more determinisitc. For (3), may I know the machine configuration your are using?

In README, I attached a picture of the learning curve of one run that reached 72.35% dev accuracy on WikiSQL. If it helps, I can also share with you the full tensorboard log and the saved best model from some more recent experiment.

koenvanderveen commented 5 years ago

Thanks for your quick response! I used a AWS g3.xlarge. I tried multiple times but I do get consistent results around 70.3.

crazydonkey200 commented 5 years ago

Thanks for the input. I will try starting some new AWS instances to see if I can replicate the issue. In the meantime, here's a link to the data of a recent run that reached 72.2% dev accuracy. The tensorboard log is in the tb_log subfolder, and the best model is saved in the best_model subfolder.

koenvanderveen commented 5 years ago

Thanks, I'd love to find out where the difference originates from. I downloaded the repo again to make sure I did not make any changes and ran again, but reached the same result. The only thing I had to change to make it work is replacing (line 70 table/utils.py) :

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError): 
  val = val.lower()

with

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError, UnicodeEncodeError): 
  val = val.lower()

Due to errors like this UnicodeEncodeError: 'decimal' codec can't encode character u'\u2013' in position 1: invalid decimal Unicode string

Do you think that might be the reason? And if so, do you have any idea how to prevent catching those errors?

crazydonkey200 commented 5 years ago

Sorry for the late reply. I have added your change into the codebase and rerun the experiments on two new AWS instances. The mean and std from 3 experiments (each averages 5 runs) are 71.92+-0.21%, 71.97+-0.17%, 71.93+-0.38%. You can also download all the data for these 3 experiments here 1 2 3.

I am also curious about the reason of the difference. I have added a new branch named fix_randomization to make the results more reproducible by controlling the random seeds. Would you like to try running the experiments again using the new branch on a AWS instance and let me know if anything changes?

Thanks.

koenvanderveen commented 5 years ago

Hi! i ran the experiments again in the fix_randomization branch, but it did not result in different results (still around 70%). Did you re-download the data before running the experiments? I cannot think of any other form of randomness at this point but the difference is quite consistent.

koenvanderveen commented 5 years ago

Oke, I finally found the source of the difference. I used a newer version of the Deep Learning AMI in AWS, i ran the experiments with v10 now and got the same results (around 71.7) . Would be interesting to know which operations are changed.

crazydonkey200 commented 5 years ago

Thanks for reporting this and for running the experiments to confirm it!

That's interesting. I would also like to look into this. So what is the newer version of Deep Learning AMI you used, is it Deep Learning AMI (Ubuntu) Version 21.0 - ami-0b294f219d14e6a82? And how do you launch instances with previous versions, for example v10? Thanks!

dungtn commented 5 years ago

Hi there :-)

I'm trying to replicate the results of WikiTableQuestions. I tried Tensorflow v1.12.0 (Deep Learning AMI 21.0) and v1.8.0 (Deep Learning AMI 10.0). The corresponding accuracies are 41.12% for v1.12.0 and 43.27% for v1.8.0. It looks like the difference is because of Tensorflow version.

Also, is the current settings in run.sh the one that was used to produce the learning curve in the image? The number of steps was set to 25,000. While in the picture, the number of steps is around 30,000. Also, the max_n_mem was set to 60, which caused Not enough memory slots for example.... I changed it to 100, but I'm not sure if it is the right thing to do? Thanks!

crazydonkey200 commented 5 years ago

Hi, thanks for the information :) I will run some experiments to compare TF v1.12.0 vs v1.8.0.

The current setting in run.sh is used to produce the result in the paper. The image is produced from an old setting that trains for 30k steps. Thanks for pointing it out. I will replace the image with a run under the current setting.

The max_n_mem was set to 60 for the sake of speed. When the table is large and requires more than 60 memory slots, some columns will be dropped (reason for the Not enough memory slots for example... warning). Changing it to 100 would probably achieve a similar or better result because no columns will be dropped, but the training will be slower.

crazydonkey200 commented 5 years ago

As an update, I have created a branch reproducible that can run training deterministically. Because it is hard to make Tensorflow deterministic when using GPU (see here for more info) and when running with multiprocessing, this branch uses only 1 trainer and 1 actor for training, so the training is very slow (takes about 44hrs to finish one training, which only takes 2-3hrs in the master branch). This branch is using tensorflow-gpu==1.12.0.

This setting gets slightly lower results on WikiTable (41.51+-0.19% dev accuracy, 42.78+-0.77% test accuracy). Below is the command to reproduce the experiments (after pulling the latest version of the repo):

git checkout reproducible
cd ~/projects/neural-symbolic-machines/table/wtq/
./run_experiments.sh run_rpd.sh mapo mapo_rpd

dungtn commented 5 years ago

Can you add more details about dataset preprocessing? For example, how to generate the all_train_saved_programs.json file?

guotong1988 commented 5 years ago

Where do you get stop_words.json?

crazydonkey200 commented 3 years ago

@dungtn Here's a detailed summary created by another researcher on how to replicate the preprocessing and experiments starting from the raw WikiTableQuestions dataset and how to adapt the code to other similar datasets. Also added the link to this summary into the readme.

@guotong1988 Unfortunately I don't remember where exactly I got the list ofstop_words.json, but it seems to be a subset of the nltk stop words, for example found here.

crazydonkey200 / neural-symbolic-machines

reproducing results paper #15