ChengyueGongR / Frequency-Agnostic

Code for NIPS 2018 paper 'Frequency-Agnostic Word Representation'
118 stars 20 forks source link

Problems on MoS-AWD-LSTM-LM #2

Open takase opened 5 years ago

takase commented 5 years ago

Hi, I'm trying to reproduce your paper results but I haven't achieved them. I tried two settings for Penn Treebank. The first one is based on descriptions of the paper:

python main.py --batch_size 12 --data penn --dropouti 0.55 --dropouth 0.2 --seed 141 --nonmono 15 --epoch 500 --dropoutl 0.3 --lr 20 --nhid 960 --nhidlast 620 --emsize 280 --n_experts 15 --single_gpu --moment --adv

In this setting, I achieved 59.47 in valid and 56.96 in test but the scores reported in the paper are 57.55 in valid and 55.23 in test.

The second one is using the MoS setting described in its github page (https://github.com/zihangdai/mos) with FRAGE:

python main.py --batch_size 12 --data penn --dropouti 0.4 --dropouth 0.225 --seed 28 --nonmono 15 --epoch 500 --dropoutl 0.29 --lr 20 --nhid 960 --nhidlast 620 --emsize 280 --n_experts 15 --single_gpu --moment --adv

This setting achieved similar scores to the paper but fine-tune results were not well. I achieved 56.11 in valid and 54.21 in test after fine-tuning but in your paper scores are 55.52 in valid and 53.31 in test.

So, could you tell me any idea for reproduction? I used P100 GPU and CUDA 8.0.

ChengyueGongR commented 5 years ago

Hi, takase, other gays also report this problem to me for pytorch 0.4.0, and I can now achieve similar score as your reported (56.02/54.02) now. I'm now trying to fix this problem. Thanks for your report.

ChengyueGongR commented 5 years ago

See https://github.com/ChengyueGongR/Frequency-Agnostic/issues/2#issuecomment-466385895 .

takase commented 5 years ago

Thank you for sharing hyper-parameters! But how about dynamiceval? For PTB, I tried several hyper-parameters but achieved 47.99 in valid and 47.39 in test with --lamb 0.2 which is my best result in validation set.

ChengyueGongR commented 5 years ago

I will give the hyper-parameter for dynamic evaluation. Note that dynamic evaluation is very sensitive to the hyper-parameter, so the best choice is to do a grid search as in the official code, https://github.com/benkrause/dynamic-evaluation.

takase commented 5 years ago

Thank you very much! I will try the grid search.

takase commented 5 years ago

I tried your recommended hyper-parameters for PTB dataset. https://github.com/ChengyueGongR/Frequency-Agnostic/issues/2#issuecomment-462675646 However, my score after twice fine-tuning is 57.48 on valid and 54.96 on test. I think these scores are significantly worse than your results.

Do you have any idea for these results? I used P100 GPU, CUDA 8.0, pytorch 0.4.1.

In addition, could you tell me suitable hyper-parameters for WikiText-2?

ChengyueGongR commented 5 years ago

Fix the bugs, reproduce the result.

I used P100 GPU, CUDA 9.0, pytorch 0.4.1.

the command should be:

takase commented 5 years ago

Thank you for telling your configuration, I re-ran the code based on https://github.com/ChengyueGongR/Frequency-Agnostic/issues/2#issuecomment-466385895. However, I achieved 56.57 in valid and 54.65 in test. As you know, this score is not better than the baseline result AWD-LSTM-MoS.

If you have achieved more better results, could you upload the trained models? Because I'm using a cloud server to run this code, it is difficult to run again and again due to budget.

takase commented 5 years ago

Thank you for your kindness! If you also tell me the hyper-parameters of dynamic evaluation, I'm very grad.

ChengyueGongR commented 5 years ago

I find the bug... very foolish bug. I forgot to rerank the dictionary according to the frequency for pytorch 0.4 version code, so my technique never works.

For the pytorch 0.2 code and results on WT2, please wait for some days. I have some other works to do and do not have lots of GPUs. I will check these things as soon as possible.

update 3.11: Find another bug. If you rank the vocabulary twice, you will get different results. This prevents you from using pre-trained model or using post-process. I will fix this bug soon.

update 3.13: I update the commands, and will release the pre-trained model for awd-mos and the command for dynamic evaluation in these days.

update 3.16: 1) I upload the pre-trained model for MoS-PTB, together with the code for MoS-PTB pytorch 0.4.1. The result is 56.00/53.82 (valid/test). 2) I will upload how to do dynamic evaluation soon. 3) For the pre-trained model and command line on MoS-WT2, please wait for some days. 4) I will reproduce the result on pytorch 0.2 and open a new branch to upload the code for pytorch 0.2.

simtony commented 5 years ago

Are there any news for the lm model without mos? I found the implementation buggy(and quite a mess). I tried to fix a few bugs and the indentation error, but I'm not sure whether it reflects the original intention. Also the hard switch to ASGD on epoch 120 is very weird.

ChengyueGongR commented 5 years ago

Are there any news for the lm model without mos? I found the implementation buggy(and quite a mess). I tried to fix a few bugs and the indentation error, but I'm not sure whether it reflects the original intention. Also the hard switch to ASGD on epoch 120 is very weird.

The bug is the same as https://github.com/ChengyueGongR/Frequency-Agnostic/issues/2#issuecomment-470443665. This bug makes the reloading failed. I will fix it these days. Once I finish it, I will tell you.

Update: @simtony I've uploaded pre-trained model together with the code. The result is better than what we report in the paper for the checkpoint finetune. You can search the hyperparameters for the post-process. You can also set the switch as a hyperparameter. I believe it's ok, but I do not have so many computation resources to tune it.

takase commented 5 years ago

Thank you for uploading pre-trained model. Did you find hyper-parameters for dynamic evaluation? I applied grid search to the pre-trained model, but could not achieved good result. I tried hyper-parameters as follows: lrlist = [0.00002, 0.00003,0.00004,0.00005,0.00006,0.00007,0.0001] lamblist = [0.001,0.002,0.003,0.004,0.005] In the above search space, I achieved 48.84 in valid and 48.01 in test (47.38 and 46.54 in your paper).

ChengyueGongR commented 5 years ago

1) First, I do the grid-search (on lr, lamb and bptt) and only improve the test PPL from 47.9(using the original hyper-parameter of MoS) to 47.7. 2) Then, I guess that the problem should be the version of pytorch. In pytorch 0.3 or early version, we can call model.eval() and calculate the gradient in dynamiceval.py. However, in pytorch 0.4, we can only call model.train() to do this. Although I have do some changes, there may be still a big gap. Therefore, I roll back to pytorch 0.2, and do the dynamic evaluation with the original hyper-parameter of MoS, and we can get 47.3 test PPL. Therefore, I believe the main reason is the change of the pytorch version, and you can do grid search on pytorch 0.2. 3) To use pytorch 0.2, you can add a patch in related code:

try:
    torch._utils._rebuild_tensor_v2
except AttributeError:
    def _rebuild_tensor_v2(storage, storage_offset, size, stride, requires_grad, backward_hooks):
        tensor = torch._utils._rebuild_tensor(storage, storage_offset, size, stride)
        tensor.requires_grad = requires_grad
        tensor._backward_hooks = backward_hooks
        return tensor
    torch._utils._rebuild_tensor_v2 = _rebuild_tensor_v2
torch.nn.Module.dump_patches = True

Also, do some minor changes on other parts. 4) Do not use the original search space in dynamic-evaluation. Try to set a search space around the original hyper-parameter of MoS. I also upload a pretrained model trained with pytorch 0.2 version . It can achieve test PPL 47.0 with the original hyper-parameter of MoS. 5) I will think about how to eliminate the gap in performance between pytorch 0.4 and early versions. If you have any advice, please feel free to contact me.