anirudh9119 / RIMs

Code for "Recurrent Independent Mechanisms"
118 stars 23 forks source link

Difficulty reproducing copying task results #8

Open bmistry4 opened 2 years ago

bmistry4 commented 2 years ago

Hi,

I've been struggling to reproduce your paper results for the copying task.

In particular, even after a few epochs, a LSTM with 300 hidden dims seems to be able to easily get both a train AND TEST loss much lower than the reported results. The test loss also is much lower the train loss (e.g. 0.09 vs 0.30 at 50 epochs). I ran the following to get this: train_copying.py --cuda --cudnn --algo lstm --lr 0.001 --drop 0.5 --nhid 300 --nlayers 1 --emsize 300 --train_len 50 --test_len 200

Furthermore, the RIMs (with 600 hidden dims) also seem to have the same trend of the test being better than the train loss. E.g. after 20 epochs the test loss is 0.07 and the train loss is still 1.26. I ran the following to get this: train_copying.py --cuda --cudnn --algo blocks --lr 0.001 --drop 0.5 --nhid 600 --num_blocks 6 --topk 4 --nlayers 1 --emsize 600 --log-interval 100 --train_len 50 --test_len 200

Are there any differences in hyperparameters given in the paper/code and used for the reported results? E.g., the code hardcodes the communication attention's keys and values to 16 whereas in Table 3 (from the paper's appendix) the values are set to be 32.

(I have already made the changes mentioned in https://github.com/anirudh9119/RIMs/issues/5 to get the LSTM algorithm running.) Thanks.

alexmlamb commented 2 years ago

Using RIMs with sufficiently small topk the test loss should eventually get to zero, while the LSTM loss should remain non-zero. Are you still observing this?

It's interesting if you can find some hyperparameters which make the test loss lower for the LSTM baseline, but in my view the most salient thing is that the test loss is non-zero. You could also try increasing test_len and seeing if the loss goes up for LSTM as you increase that.

bmistry4 commented 2 years ago

Hi, thanks for the quick reply. I managed to create the expected trends after running for more epochs.

I was just wondering though, could you explain the purpose for adding the residual for the communication attention (see https://github.com/anirudh9119/RIMs/blob/master/event_based/attention.py#L204) when using the RIMs. (I couldn't find any explanation in your paper.)

Thanks again.