Unable to reproduce result for validation set

sgdgp commented 4 years ago

Hi, Thanks for the amazing dataset and for sharing your code. I am unable to reproduce the results for validation set seen. I downloaded the checkpoints as provided by you and I am using the best_seen.pth I am getting SR 0.0097 and GC 0.0659 whereas the result on val seen in the paper is SR 0.037 and GC 0.1.

Could you point to any stuff I might have missed ?

For starting XServer I used sudo nvidia-xconfig -a --use-display-device=None --virtual=1024x786 sudo /usr/bin/X :0 &

I face two warnings UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn("Default upsampling behavior when mode={} is changed "

UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")

The second warning won't affect the results but I wanted to confirm if upsampling with align corners was intended or whether the warning appeared earlier too and I should ignore it ?

MohitShridhar commented 4 years ago

Hi @sgdgp, some users have reported that they had to train the model themselves instead of using the pretrained models. I still haven't figured out the source of this issue, but it seems like only certain users are affected by this. I'll mention it in the FAQ.

You might have better luck with the Docker setup.

sgdgp commented 4 years ago

Thanks @MohitShridhar . Also the training is with decoder teacher forcing enabled right ?

MohitShridhar commented 4 years ago

Ah no, leave it to the default False. You can use the settings specified in the training example.

sgdgp commented 4 years ago

Oh I see. Thanks!

MohitShridhar commented 4 years ago

Not sure if this is causing the issue, but check that your versions of torch and torchvision are consistent with requirements.txt

sgdgp commented 4 years ago

I am trying with the dockerfile. I will update on the status soon.

IgorDroz commented 4 years ago

@sgdgp Have you managed to reproduce the paper's results after all?

@MohitShridhar How did you choose the model that produces the results in your paper? I tried both - best_seen and best_unseen and both perform worse.

MohitShridhar commented 4 years ago

@IgorDroz we picked the best_seen model.

You can try training the model yourself (from scratch) if the problem persists.

IgorDroz commented 4 years ago

@MohitShridhar

I trained from scratch and got (valid_seen) SR: 12/820 = 0.015 GC: 172/2109 = 0.082 PLW SR: 0.011 PLW GC: 0.072

while in the paper you achieved : SR: 0.032 GC: 0.1 PLW SR: 0.021 PLW GC: 0.07

The only difference between me and you is the initialization but still you got x2 better results in SR..

Additionally, i wanted to ask you regarding the testing. Is it only done via submission? or since the challenge has finished, will you be able to release a code with the actual GT of the test?

Thanks, Igor

MohitShridhar commented 4 years ago

The only difference between me and you is the initialization but still you got x2 better results in SR..

Sorry, what's the initialization difference? And also, is this inside a Docker container?

or since the challenge has finished, will you be able to release a code with the actual GT of the test?

No. The leaderboard is a perpetual benchmark for ALFRED. As with any benchmark in the community, the test set will remain a secret to prevent cheating/overfitting. To evaluate on the test set, use the leaderboard submission.

IgorDroz commented 4 years ago

@MohitShridhar The initialization of the neural net, the initial weights. And no, it is not inside a docker.

MohitShridhar commented 4 years ago

@IgorDroz can you report your torch and torchvision versions along with CUDA and GPU specs? Also, which resnet checkpoint are you using from torchvision?

IgorDroz commented 4 years ago

@MohitShridhar torch==1.1.0 torchvision==0.3.0 CUDA Version: 11.1 GPU is Tesla K80 nvidia Driver Version: 455.23.05

How can i check the resnet checkpoint?

MohitShridhar commented 4 years ago

@IgorDroz, it's usually inside $HOME/.cache/torch/checkpoints/. I am using resnet34-333f7ec4.pth.

IgorDroz commented 3 years ago

@IgorDroz, it's usually inside $HOME/.cache/torch/checkpoints/. I am using resnet34-333f7ec4.pth.

@MohitShridhar Sorry for the late answer, so probably this is the difference, i use resnet18-5c106cde.pth. Now it makes sense, thanks!

MohitShridhar commented 3 years ago

Oops, sorry. I just checked again. I am also using resnet18-5c106cde.pth, so it's probably not the issue.

The next thing to try would be run this inside docker to make sure the setup is exactly the same.

IgorDroz commented 3 years ago

@MohitShridhar Hi again,

Just saw your answer. yet i am not able to reproduce your results, docker shouldn't really matter as the environment is the same and i should be able to get similar results to yours...

a recap of what i tried and what i got:

I used your pre-trained model (https://github.com/askforalfred/alfred/tree/master/models#pre-trained-model) and ran evaluation. The results are: SR: 8/820 = 0.01 GC: 143/2109 = 0.068 PLW SR: 0.003 PLW GC: 0.038

Which results did you achieve with this model? because they are pretty far from what you have reported in the paper: SR: 0.032 GC: 0.1 PLW SR: 0.021 PLW GC: 0.07
i also trained from scratch and got: SR: 8/820 = 0.01 GC: 143/2109 = 0.068 PLW SR: 0.007 PLW GC: 0.049 (which is quite similar to the results i got using your pretrained model)

this time i used P100 GPU like you, yet the results are different. How can it be? i will attach my packages:

ai2thor==2.1.0 cached-property==1.5.2 certifi==2020.12.5 chardet==4.0.0 click==7.1.2 cycler==0.10.0 decorator==4.4.2 Flask==1.1.2 h5py==3.1.0 idna==2.10 itsdangerous==1.1.0 Jinja2==2.11.2 kiwisolver==1.3.1 MarkupSafe==1.1.1 matplotlib==3.3.3 networkx==2.5 numpy==1.19.5 opencv-python==4.5.1.48 pandas==1.2.0 Pillow==8.1.0 progressbar2==3.53.1 protobuf==3.14.0 pyparsing==2.4.7 python-dateutil==2.8.1 python-utils==2.4.0 pytz==2020.5 PyYAML==5.3.1 requests==2.25.1 revtok==0.0.3 six==1.15.0 tensorboardX==1.8 torch==1.1.0 torchvision==0.3.0 tqdm==4.56.0 urllib3==1.26.2 vocab==0.0.5 Werkzeug==1.0.1

MohitShridhar commented 3 years ago

@IgorDroz Docker is a way to ensure that the setup is completely identical (like CUDA, torch, torchvision etc).

Check out this work, and their reproduced results. Their models are also substantially better than the baselines reported in the ALFRED paper.

I am not sure what else could be causing this issue. Sorry.

IgorDroz commented 3 years ago

@MohitShridhar I will definitely check their work out, thanks! i noticed that there is another work with even better results on the leaderboard, do you have their paper by any chance?

MohitShridhar commented 3 years ago

@IgorDroz I don't think the leaderboard topper has made their paper/code publicly available. It's probably a recent submission (or to be submitted), so you'd have to wait for the anonymity period to end.

IgorDroz commented 3 years ago

@MohitShridhar okay, thanks a lot!

dnandha commented 3 years ago

Cannot reproduce results either using the pre-trained best-seen model (and resnet18-5c106cde.pt). I'm on torch==1.9.0 (py3.7_cuda10.2_cudnn7.6.5_0), results look similar to the ones posted above by other users.

SR: 8/820 = 0.010 GC: 142/2109 = 0.067 PLW SR: 0.003 PLW GC: 0.038

Was anyone able to reproduce the results at all? Just aking.

askforalfred / alfred

Unable to reproduce result for validation set #40