size mismatch for decoder_output.0.weight and decoder_output.0.bias

IsaacKam commented 5 years ago

Thanks for this great piece of work! When I change the template code from 'normal' to 'segment_unsup2d' like this:

subprocess.call("curl -O https://raw.githubusercontent.com/StanfordVL/taskonomy/master/taskbank/assets/test.png", shell=True)
image = Image.open('test.png')
x = TF.to_tensor(TF.resize(image, 256)) * 2 - 1
x = x.unsqueeze_(0)
representation = visualpriors.representation_transform(x, 'segment_unsup2d', device='cpu')
pred = visualpriors.feature_readout(x, 'segment_unsup2d', device='cpu')
a = TF.to_pil_image(pred[0] / 2. + 0.5)
plt.imshow(a)

I get the following errorw which is being causes by the feature_readout function:

RuntimeError: Error(s) in loading state_dict for TaskonomyDecoder:
  size mismatch for decoder_output.0.weight: copying a param with shape torch.Size([64, 16, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 16, 3, 3]).
  size mismatch for decoder_output.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).

Let me know if i'm doing something wrong.

alexsax commented 5 years ago

Hi @IsaacKam--thanks for raising this issue!

Good catch here. It looks like for the segmentation decoders I accidentally set the wrong default output size. The output size should be 64 channels, not 128. I just pushed a fix that you can apply on your end by running the following:

pip uninstall visualpriors
pip install https://github.com/alexsax/midlevel-reps/archive/visualpriors-v0.3.1.zip

Aside from the above shape issues, I also want to note that I imagine the decodings are primarily useful for debugging. Visualizing those outputs will give you confidence that everything is working correctly.

For learning, though I've found the encodings to be generally more useful than the decodings. This is because the encodings all have a homogeneous shape (8 x 16 x 16), while the decodings can take various forms. For example segment_unsup2d produces a 64-channel image, while class_object is a 1000-dimensional vector. And using the encodings doesn't really sacrifice anything: I've anecdotally found that downstream performance using the encodings is usually at least as good, if not better, than using the decodings.

I'm closing this issue for now, but if the above doesn't solve your problem then please feel free to reopen.

IsaacKam commented 5 years ago

This is really useful :), thank you for the prompt reply. For learning from the encodings, what would you recommend as the best way to utilise them. I.e would you flatten it at this point and apply linear layers or is there a benefit of apply some conv layers here.

IsaacKam commented 5 years ago

Hi Alex, it now seems to output a torch.Size([1, 64, 256, 256]) tensor when i use 'segment_unsup2d' is this supposed to be correct, if so what do the channels represent (different segments?)

alexsax / midlevel-reps

size mismatch for decoder_output.0.weight and decoder_output.0.bias #2