The error in fp16, using bert as text encoder

When I change the clip to the bert text encoder, because the code is really complex, so I only changed two positions in the mdm.py, like this: def load_and_freeze_clip(self, clip_version): bert_model = BertModel.from_pretrained('bert-base-uncased') bert_model.eval() for p in bert_model.parameters(): p.requires_grad = False return bert_model def encode_text(self, raw_text): encoded_text = tokenizer(raw_text, padding='max_length', max_length = default_context_len,truncation=True, return_tensors='pt').to(device) self.projection_layer = nn.Linear(768, 512).to(device) bert_outputs = self.clip_model(**encoded_text).last_hidden_state.mean(dim=1) final_outputs = self.projection_layer(bert_outputs).half() print("bert",final_outputs.shape) print(final_outputs) return final_outputs I know i didn't make specific hardcoding for humanml dataset, since I want to make sure the mdm can be trainning successfully before tunning. Now, it can running to the step 0, and shows some result like:

Loading CLIP... Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Total params: 17.88M Training... Starting epoch 0 0%| | 0/383 [00:00<?, ?it/s]bert torch.Size([64, 512]) tensor([[ 0.0623, -0.0523, -0.2820, ..., 0.1517, -0.0875, 0.0013], [ 0.0647, -0.1523, -0.2150, ..., 0.1599, -0.1925, 0.0150], [-0.0446, -0.1167, -0.3376, ..., 0.2583, -0.0906, -0.0287], ..., [ 0.1151, -0.0838, -0.3091, ..., 0.0053, -0.0564, 0.0211], [-0.0790, 0.0243, -0.2876, ..., 0.1582, -0.1132, -0.0510], [-0.1652, 0.0866, -0.3713, ..., 0.2041, -0.0336, 0.0713]], device='cuda:0', dtype=torch.float16, grad_fn=) Logging to /tmp/openai-2023-07-30-20-45-37-243786 step[0]: loss[1.42672] 0%| | 0/383 [00:01<?, ?it/s] Traceback (most recent call last): File "/home/zonghengli/anaconda3/envs/mdm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/zonghengli/anaconda3/envs/mdm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/zonghengli/Study/motion-diffusion-model changed/train/train_mdm.py", line 49, in main() File "/home/zonghengli/Study/motion-diffusion-model changed/train/train_mdm.py", line 45, in main TrainLoop(args, train_platform, model, diffusion, data).run_loop() File "/home/zonghengli/Study/motion-diffusion-model changed/train/training_loop.py", line 150, in run_loop self.save() File "/home/zonghengli/Study/motion-diffusion-model changed/train/training_loop.py", line 280, in save save_checkpoint(self.mp_trainer.master_params) File "/home/zonghengli/Study/motion-diffusion-model changed/train/training_loop.py", line 268, in save_checkpoint state_dict = self.mp_trainer.master_params_to_state_dict(params) File "/home/zonghengli/Study/motion-diffusion-model changed/diffusion/fp16_util.py", line 228, in master_params_to_state_dict self.model, self.param_groups_and_shapes, master_params, self.use_fp16 File "/home/zonghengli/Study/motion-diffusion-model changed/diffusion/fp16_util.py", line 112, in master_params_to_state_dict state_dict[name] = master_params[i] IndexError: list index out of range

I am really confused about this error, and I don't know how to solve it, can someone just help me with this problem? Really appreciate!!!!!!!

GuyTevet / motion-diffusion-model

The error in fp16, using bert as text encoder #140