anirudh9119 / RIMs

Code for "Recurrent Independent Mechanisms"
118 stars 23 forks source link

question about rim performance on sequential mnist #2

Open ziyuwwang opened 4 years ago

ziyuwwang commented 4 years ago

Hello, thank you for releasing the implementations of RIMs. I am reproducing your work on the released codes and I just cannot reach the performance reported in RIMs paper on sequential mnist experiment by using following training instructions: "bash experiment_mnist_1layered.sh 600 6 4". I get a much lower accuracy of " 0.78, 0.55, 0.33", compared to "0.90, 0.73, 0.38" in RIMs paper. Could you provide the exact training command or experimental setting? Thanks.

anirudh9119 commented 4 years ago

Thanks @ziyuwwang for letting me know. It's possible there's some discrepancy somewhere. I'll take a look in next few days. (I've my PhD qualifier talk in coming week).

If you want something quick, there's another replication of our work, that does get similar results. https://github.com/dido1998/Recurrent-Independent-Mechanisms

EDIT 1: I ran the code with 510 hidden units, and 5 RIMs, and 3 active with learning rate 0.001. It gets around "0.88, 0.72, 0.39" in about 20 epochs.

ziyuwwang commented 4 years ago

Thank you very much for your help. I will try it according to your instructions. Would you mind providing specific settings for other experiments at your convenience?

anirudh9119 commented 4 years ago

Hello @ziyuwwang,

This is the configuration (copied from my code, used in the paper) which gave approximately the same result.

Learning rate: 0.001 600, 6RIMs, top_k = 4, dropout = 0.2.

Here, https://github.com/anirudh9119/RIMs/blob/master/event_based/blocks_core.py#L47 change d_k = 32, d_v = 32 Here, https://github.com/anirudh9119/RIMs/blob/master/event_based/blocks_core.py#L49 change self.att_out = 400.

I ran it, and it did gave same results.

0.90234375, 0.7203525641025641, 0.3892227564102564.

I've not systematically studied what made the difference.

anirudh9119 commented 4 years ago

Also, note we did use the above configuration for most of our experiments. Only thing to be careful of

  1. Usage of dropout. For bouncing balls, using dropout in the encoder seem to hurt the results, and that makes sense, as there's already a dropout in the attention parameters, and since there's also bottleneck of attention too.

Though, I admit for the paper, we did not use any dropout for bouncing balls, and only recently we figured out that using dropout in the encoder hurt the results.

Hope that helps!

You may also be interested in: https://arxiv.org/abs/2006.16225. This work fixes a big issue in the current work. (exchangeability of different modules).

ziyuwwang commented 4 years ago

It's very nice of you to give so many useful suggestions @anirudh9119 ! But still I cannot get a result close to that of the original paper after running the released codes with exact the setting you provide above. I am wandering that:

  1. Whether there's some discrepancy between your code and the released code?
  2. Whether the expected validation accuracy of different resolutions are achieved with the same checkpoint or different checkpoints?

Thanks for your prompt reply!

anirudh9119 commented 4 years ago

It's very nice of you to give so many useful suggestions @anirudh9119 ! But still I cannot get a result close to that of the original paper after running the released codes with exact the setting you provide above

I apologize for wasting your time. It should have been something that works right in the first attempt.

Whether there's some discrepancy between your code and the released code?

I can check again, but it seems there's no discrepancy as of now.

Whether the expected validation accuracy of different resolutions are achieved with the same checkpoint or different checkpoints?

You should see: "Test Optim".

@ziyuwwang It's exactly the same as in the training script. Expected validation accuracy of different resolutions are according to different checkpoints (checkpoint is dependent on validation data for that resolution). We do this both for the proposed method as well as all the baselines (LSTM/RMC etc). The reason was: It's not obvious that best iid accuracy on the training distribution would result in best ood distribution, and hence we do model selection. Indeed we observed that it effected adversely the most for LSTMs. Let me know if this resolves your question. If not, I can see whats the discrepency.

An independent replication also obtains similar result as provided in the paper: https://github.com/dido1998/Recurrent-Independent-Mechanisms

ziyuwwang commented 4 years ago

@anirudh9119 I reached a result about "0.84, 0.70, 0.38" of "Test Optim" with the setting of 510 hidden units, and 5 RIMs, and 3 active and learning rate 0.001. I set d_k = 32, d_v = 32, dropout = 0.2.and self.att_out = 400 in blocks_core.py. These changes does improve the performance but I suggest that you should check the codes and attach the right settings. For example, this lines https://github.com/anirudh9119/RIMs/blob/610aa6c80bf72e1bd6228ccfea05026f337b02ed/event_based/blocks_core.py#L78 seems to be useless and redundant. I guess there must be some discrepency.

anirudh9119 commented 4 years ago

@ziyuwwang I think I suggested above to use.

600, 6RIMs, top_k = 4.

alexmlamb commented 4 years ago

null_score = iatt.mean((0,1))[1]

Yes this line was used for logging (to see the activation scores). It's not used in making the activation mask.

antoniogois commented 4 years ago

Hi everyone! I tried 100 epochs using both the original configuration from the rep for sequential-mnist, and @anirudh9119 's suggestion in this thread

just to recap, parameters of _originalconfig are: --cuda --cudnn --algo blocks --name Blocks_MNIST/original_conf --lr .0007 --drop 0.5 --nhid 600 --num_blocks 6 --topk 4 --nlayers 1 --emsize 600 --log-interval 100

parameters of _Anirudhconfig are: --cuda --cudnn --algo blocks --name Blocks_MNIST/anirudh --lr .001 --drop 0.2 --nhid 600 --num_blocks 6 --topk 4 --nlayers 1 --emsize 600 --log-interval 100 In _Anirudhconfig I'm applying both code changes suggested above:

Here, https://github.com/anirudh9119/RIMs/blob/master/event_based/blocks_core.py#L47 change d_k = 32, d_v = 32 Here, https://github.com/anirudh9119/RIMs/blob/master/event_based/blocks_core.py#L49 change self.att_out = 400.

[I also ran _Anirudhconfig, but with the original dropout=0.5, by accident] parameters of _Anirudhdrop05 are: --cuda --cudnn --algo blocks --name Blocks_MNIST/anirudh_drop05 --lr .001 --drop 0.5 --nhid 600 --num_blocks 6 --topk 4 --nlayers 1 --emsize 600 --log-interval 100

These were the results I got: _originalconfig: 83.7, 69.3, 45.3 (test values at epoch 16, 12 and 12 respectively [when best valid-values occurred]) _Anirudhconfig: 83.1, 55.6, 35.2 (test values at epochs 16, 4, and 12 respectively [when best valid-values occurred]) _Anirudhdrop05: 82.2, 54.9, 29.5 (test values at epochs 24, 4, and 4 respectively [when best valid-values occurred])

For the 3rd setting (24x24 resolution) I got even better than in the paper, but for the first two (16x16 and 19x19) I couldn't reach the values in the paper, like @ziyuwwang

One final note: It seems like the training is deterministic in my machine (kudos for that!), but maybe something changes from one machine to another?

anirudh9119 commented 4 years ago

Thanks @antoniogois. I'll see what's making the difference.

anirudh9119 commented 4 years ago

May be a moot point but still, what pytorch version are you using @antoniogois ?

antoniogois commented 4 years ago

using 1.6.0

anirudh9119 commented 4 years ago

@antoniogois @ziyuwwang I'm investigating. I'm not sure, whats going on wrong as of now.

I ran again with the https://drive.google.com/file/d/1KKz3YEyZJ4-2XY40d7akrrrBynG9elVT/view?usp=sharing. I got same results. (It's the official code, I used, and its same as to what here in the repo is). I've uploaded the details of my conda env. I'm investigating what happens by changing the pytorch version. Will keep you updated.

antoniogois commented 4 years ago

I think you need to provide access to that google drive file :) let me know if I can help with anything regarding this issue, maybe I can try re-running with the same pytorch version as you [after I have access to the google drive]

anirudh9119 commented 4 years ago

Sorry. Changed it. the code in zip, and the code here is same. One of my colleagues was also not able to reproduce the results, so I'm looking at it. Thanks for your patience.

anirudh9119 commented 4 years ago

@antoniogois @ziyuwwang With everything installed from scratch, I was also able to "reproduce" the issue. (i.e results dont match).

Training logs for the setting where I was able to reproduce the results in the paper are here. https://gist.github.com/anirudh9119/f7d36c9eac054c3d712ed961382750c1

I'm investigating it. Sorry for the problem.

rederoth commented 3 years ago

@anirudh9119 I'm sorry but I'm very confused. Do I understand it right, that for the performance in the paper you report results that are obtained from different training epochs? So you are basically testing with differently trained networks? Isn't the whole point, that the (same) model should work irrespective of the input length?

luigiquara commented 4 months ago

Hi everyone! I'm also experiencing problems in reproducing the results reported in the paper. Are there any update about this question?

Thank you!