Difference of embedding layer dimension between pretrained model and dict.txt

zhoubay commented 1 year ago

Hi there,

I'm trying to load pretrained weights of molecular pretrain(https://github.com/dptech-corp/Uni-Mol/releases/download/v0.1/mol_pre_no_h_220816.pt), but using the example_data/molecule/dict.txt leads to an Exception below.

RuntimeError: Error(s) in loading state_dict for MoleculeEmbeddingModel:
        size mismatch for embed_tokens.weight: copying a param with shape torch.Size([31, 512]) from checkpoint, the shape in current model is torch.Size([30, 512]).
        size mismatch for gbf.mul.weight: copying a param with shape torch.Size([961, 1]) from checkpoint, the shape in current model is torch.Size([900, 1]).
        size mismatch for gbf.bias.weight: copying a param with shape torch.Size([961, 1]) from checkpoint, the shape in current model is torch.Size([900, 1]).

I got the cause of this Exception, since adding a line code in the unimol/infer.py self.mask_idx = dictionary.add_symbol("[MASK]", is_special=True) solves this problem. (https://github.com/dptech-corp/Uni-Mol/blob/27ad2a0dbfafc9795b36efb279d7ed7c6d87a34a/unimol/tasks/unimol.py#L122)

My question is, why don't you just add a line of [MASK] to dict.txt to solve this problem?

My point is, whenever we use the checkpoints you offer, this code is irrelevant to coding but necessary for running.

What's your concern about this?

ZhouGengmo commented 1 year ago

Could you provide the running script? The downstream task does not require masking, so there is no [MASK] item in the dictionary.

zhoubay commented 1 year ago

Could you provide the running script? The downstream task does not require masking, so there is no [MASK] item in the dictionary.

The running script is like this:

dictionary = Dictionary.load(os.path.join("Uni-Mol/notebooks/results", "dict.txt"))
mask_idx = dictionary.add_symbol("[MASK]", is_special=True)
model=UniMolModel(dictionary)
model_dict = torch.load("Uni-Mol/ckpt_model/mol_pre_no_h_220816.pt")
model.load_state_dict(model_dict["model"], strict=False)

If mask_idx = dictionary.add_symbol("[MASK]", is_special=True) is removed, then the Exception will emerge, since the nn.Embedding was designed for 31 tokens instead of 30.

ZhouGengmo commented 1 year ago

Got it, and which task is associated with this issue? [mol/pocket pretrain; mol/pocket property prediction; conf gen; binding pose prediction; binding pose demo; mol repr demo] Moreover, if a running script like this is provided, it could help us to reproduce the problem and fix it.

data_path="./conformation_generation"  # replace to your data path
results_path="./infer_confgen"  # replace to your results path
weight_path="./save_confgen/checkpoint_best.pt"  # replace to your ckpt path
batch_size=128
task_name="qm9"  # or "drugs", conformation generation task name 
recycles=4

python ./unimol/infer.py --user-dir ./unimol $data_path --task-name $task_name --valid-subset test \
       --results-path $results_path \
       --num-workers 8 --ddp-backend=c10d --batch-size $batch_size \
       --task mol_confG --loss mol_confG --arch mol_confG \
       --num-recycles $recycles \
       --path $weight_path \
       --fp16 --fp16-init-scale 4 --fp16-scale-window 256 \
       --log-interval 50 --log-format simple

zhoubay commented 1 year ago

well, actually I'm trying to use your pretrained weights to do other tasks, so I didn't dig so much deep into your Uni-Core framework, which I think is a remarkable work.

About this issue, I've added a [MASK] token to dict.txt and there's no difference.

btw, instead of reading source code directly, are there any resources to learn your framework?

ZhouGengmo commented 1 year ago

btw, instead of reading source code directly, are there any resources to learn your framework?

Hope this helps. https://github.com/dptech-corp/Uni-Core#acknowledgement

deepmodeling / Uni-Mol

Difference of embedding layer dimension between pretrained model and dict.txt #67