Provided checkpoint model is inconsistent with the latest code

lym29 commented 9 months ago

Hi, when I try to run the eval code using the checkpoints you provided, there are parameters that have mismach size. I was trying to figure out it by myself but none of my attempts worked... Could you please provide a little help ? Thanks a lot!

Here is the output for ipdf model:

RuntimeError: Error(s) in loading state_dict for IPDFModel:
size mismatch for net.implicit_model.feat_mlp.weight: copying a param with shape torch.Size([256, 64]) from checkpoint,
the shape in current model is torch.Size([256, 128]).

lym29 commented 9 months ago

self.mlp_layer_sizes = cfg["model"]["mlp_layer_sizes"]  # [256, 256, 256] 
if cfg["model"]["network"]["type"] == "epn_net": 
    self.input_feat_dim = cfg["model"]["network"]["equi_feat_mlps"][-1]  # 64 
elif cfg["model"]["network"]["type"] in ["pn_rotation_net", "kpconv_rot_net"]: 
    self.input_feat_dim = 128

In ipdf_network.py it shows that if the type of ImplicitModel is 'pn_rotation_net', then the input feat dim is 128. But after loading the checkpoint in RotationNet/Dv3Rot, the output of input_feat_dim is 64. So I checked the config file ipdf.yaml, it is

network:
  type: pn_rotation_net

number_fourier_components: 1
mlp_layer_sizes: [256, 256, 256]
num_train_queries: 4096

loss_weight:
  log_prob: 1.0

So I replace 'pn_rotation_net' with 'epn_net'. But there is a new error when running the eval code, which indicates that the value of equi_feat_mlps is not given.

omegaconf.errors.ConfigKeyError: Key 'equi_feat_mlps' is not in struct full_key: model.network.equi_feat_mlps object_type=dict

I am wondering if you could provide a new config file for ipdf.

BTW, I would greatly appreciate it if you could clarify which networks these different types refer to. Thanks!

lym29 commented 9 months ago

I have made several attempts to modify the code. One of the changes I made was replacing 'pn_rotation_net' with 'epn_net' in the rotation net configuration. Additionally, I added an 'equi_feat_mlps' parameter in the ipdf.yaml file, which now appears as follows:

network:
  type: epn_net
  equi_feat_mlps: [64, 64, 64]
  # type: pn_rotation_net

number_fourier_components: 1
mlp_layer_sizes: [256, 256, 256]
num_train_queries: 4096

loss_weight:
  log_prob: 1.0

To address the issue of incompatible dimensions in the geometric features extracted by the backbone, I made a specific modification to the pointnet++ encoder. I adjusted the output dimension of the third linear layer from 128 to 64.

These are all the modification I have done. I am afraid that these are not your original network architecture.

Please let me know if these modifications align with the desired changes or if there are any further adjustments needed.

Thank you!

XYZ-99 commented 9 months ago

I doubt the problem is with the checkpoint, not the config. Probably we uploaded the ckpt of the wrong network version. We're looking into that and will notify you when the correct one is ready.

lym29 commented 9 months ago

I doubt the problem is with the checkpoint, not the config. Probably we uploaded the ckpt of the wrong network version. We're looking into that and will notify you when the correct one is ready.

Thanks! looking forward to your update.

XYZ-99 commented 9 months ago

How about this ckpt for IPDF? Remember to roll back to the original config.

lym29 commented 9 months ago

Yes, this ckpt for IPDF can be correctly loaded. Thank you so much. But I met another error in eval

Traceback (most recent call last): File "network/eval.py", line 189, in main(cfg) File "network/eval.py", line 83, in main preddict, = trainer.test(data) File "/home/liuym/Project/UniDexGrasp/dexgrasp_generation/network/trainer.py", line 201, in test self.model.test() File "/home/liuym/Project/UniDexGrasp/dexgrasp_generation/network/data/../../datasets/../network/models/model.py", line 139, in test self.pred_dict.update(self.net.sample(self.feed_dict)) File "/home/liuym/Project/UniDexGrasp/dexgrasp_generation/network/data/../../datasets/../network/models/../../network/models/graspglow/glow_network.py", line 117, in sample samples, log_prob = self.flow.sample_and_log_prob(self.sample_num, feat) File "/home/liuym/Project/UniDexGrasp/dexgrasp_generation/network/data/../../datasets/../network/models/../../network/models/graspglow/glow_network.py", line 36, in sample_and_log_prob raise NotImplementedError()

I think it happens because the flow of DexGlow is not initialized before sampling. https://github.com/PKU-EPIC/UniDexGrasp/blob/e724a94f8260dee888477a2e9d048272dfe4fd7c/dexgrasp_generation/network/models/graspglow/glow_network.py#L116 I replace this line with code below. But not sure about the input.... Is the feat supposed to be the context for initialization?

gt = (dic['canon_translation'], dic['hand_qpos']) 
self.flow.initialize(gt, feat)
samples, log_prob = self.flow.sample_and_log_prob(self.sample_num, feat)

lhrrhl0419 commented 9 months ago

The issue occurs because the checkpoint used by glow was generated using an older version of the code and isn't being loaded correctly. We've resolved this problem by implementing the following code changes in dexgrasp_generation/network/trainer.py:

        try:
            self.model.load_state_dict(ckpt)
        except:
            # load old version glow
            new_ckpt = OrderedDict()
            for name in ckpt.keys():
                new_ckpt[name.replace('backbone.', '')] = ckpt[name]
            self.model.load_state_dict(new_ckpt, strict=False)

Additionally, the check in glow_network.py is no longer necessary so we've removed it

lym29 commented 9 months ago

I tried the updated code, it works perfectly. Thanks!

PKU-EPIC / UniDexGrasp

Provided checkpoint model is inconsistent with the latest code #12