Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.56k stars 243 forks source link

Some weights of OtterForConditionalGeneration were not initialized from the model #270

Open xmc-andy opened 1 year ago

xmc-andy commented 1 year ago

Hello, I encountered such an output when testing the trained weights. I spent a long time trying to find out the reason. Unfortunately, I haven't found out the cause of this problem yet. Can you help me? I once used official weights to train a baseline on my own data for classification. The results were not very good, but the prompt "Some weights of OtterForConditionalGeneration were not initialized, and are newly initialized" did not appear. When I trained another version This situation occurred after testing the model.

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:30<00:00, 7.62s/it] Some weights of OtterForConditionalGeneration were not initialized from the model checkpoint at /mnt/large_model/weights/BC4-partScale-negAug3 and are newly initialized: ['vision_encoder.vision_model.embeddings.position_ids'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Luodian commented 1 year ago

May I know your task type and which version of Otter model you are using for initialization?

xmc-andy commented 1 year ago

May I know your task type and which version of Otter model you are using for initialization?

I am doing a classification task, with multiple images and a single prompt as input, in SD data set format, and the pre-training weights are "OTTER-Image-MPT7B"

xmc-andy commented 1 year ago

export PYTHONPATH=.

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \ pipeline/train/instruction_following.py \ --pretrained_model_name_or_path /mnt/large_model/weights/OTTER-Image-MPT7B_git \ --mimicit_vt_path /mnt/large_model/output/XX/SD_instruction.json \ --images_vt_path /mnt/large_model/output/XX/SD.json \ --external_save_dir /mnt/large_model/output/XX/OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3 \ --batch_size 1 \ --num_epochs 15 \ --run_name OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3 \ --workers 24 \ --lr_scheduler cosine \ --learning_rate 1e-5 \ --max-src-length 256 \ --warmup_steps_ratio 0.01 \ --save_ckpt_each_epoch \ --delete_previous_checkpoint \ --report_to_wandb \

Luodian commented 1 year ago

Will the missing weights log appear after you directly load the model? You may breakpoint after loading process finished.

xmc-andy commented 1 year ago

When loading the pre-trained weights you posted and the baseline weights I trained, there will be no log loss of weights, but there will be when loading the newly trained model weights. Sorry, I can't find the location and reason of the log information

xmc-andy commented 1 year ago

By the way, due to network problems, the network cannot download tokenizer_config.json from huggingface's MPT, so I downloaded it offline through "https://huggingface.co/mosaicml/mpt-7b-instruct", except for the bin file and in modeling_otter.py The modified code is text_tokenizer = AutoTokenizer.from_pretrained("/mnt/train_pipeline-master/Otter/mpt-7b-instruct")

Luodian commented 1 year ago

Could you download the model's config from this path?

https://openxlab.org.cn/models/detail/YuanhanZhang/OTTER-Image-MPT7B

The config.json should be in following format:

{
  "_commit_hash": null,
  "_name_or_path": "/mnt/petrelfs/zhangyuanhan/weights/flamingo-mpt-7B",
  "architectures": [
    "OtterForConditionalGeneration"
  ],
  "cross_attn_every_n_layers": 4,
  "model_type": "otter",
  "text_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "MPTForCausalLM"
    ],
    "attn_config": {
      "alibi": true,
      "alibi_bias_max": 8,
      "attn_impl": "torch",
      "attn_pdrop": 0,
      "attn_type": "multihead_attention",
      "attn_uses_sequence_id": false,
      "clip_qkv": null,
      "prefix_lm": false,
      "qk_ln": false,
      "softmax_scale": null
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "d_model": 4096,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "emb_pdrop": 0,
    "embedding_fraction": 1.0,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "expansion_ratio": 4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "init_config": {
      "emb_init_std": null,
      "emb_init_uniform_lim": null,
      "fan_mode": "fan_in",
      "init_div_is_residual": true,
      "init_gain": 0,
      "init_nonlinearity": "relu",
      "init_std": 0.02,
      "name": "kaiming_normal_",
      "verbose": 0
    },
    "init_device": "cpu",
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "learned_pos_emb": true,
    "length_penalty": 1.0,
    "logit_scale": null,
    "max_length": 20,
    "max_seq_len": 2048,
    "min_length": 0,
    "model_type": "mpt",
    "n_heads": 32,
    "n_layers": 32,
    "no_bias": true,
    "no_repeat_ngram_size": 0,
    "norm_type": "low_precision_layernorm",
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "resid_pdrop": 0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "tokenizer_name": "EleutherAI/gpt-neox-20b",
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.30.1",
    "typical_p": 1.0,
    "use_bfloat16": false,
    "use_cache": false,
    "verbose": 0,
    "vocab_size": 50432
  },
  "torch_dtype": "float32",
  "transformers_version": null,
  "use_media_placement_augmentation": true,
  "vision_config": {
    "_name_or_path": "openai/clip-vit-large-patch14",
    "add_cross_attention": false,
    "architectures": null,
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "quick_gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 224,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-05,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "clip_vision_model",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "projection_dim": 512,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "transformers_version": "4.30.1",
    "typical_p": 1.0,
    "use_bfloat16": false
  }
}

And also, make sure you use the save_pretrained method to save checkpoints.

            unwrapped_model.save_pretrained(
                f"{args.external_save_dir}",
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
                state_dict=checkpoint_dict,
            )
Luodian commented 1 year ago

The missing of position_ids is usually from the LLM part. And make sure you are using the latest branch code for loading Otter model when init.

And now you can try pip install -U otter_ai.

And then from otter_ai import OtterForConditionalGeneration.

That will automatically handle to loading of modeling_mpt.py.

xmc-andy commented 1 year ago

I compared the config.json. Except for "_name_or_path" and "transformers_version", the rest are consistent with what you posted. This should not be the problem. I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?

{ "_commit_hash": null, "_name_or_path": "/mnt/large_model/weights/OTTER-Image-MPT7B_git", "architectures": [ "OtterForConditionalGeneration" ], "cross_attn_every_n_layers": 4, "model_type": "otter", "text_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "MPTForCausalLM" ], "attn_config": { "alibi": true, "alibi_bias_max": 8, "attn_impl": "torch", "attn_pdrop": 0, "attn_type": "multihead_attention", "attn_uses_sequence_id": false, "clip_qkv": null, "prefix_lm": false, "qk_ln": false, "softmax_scale": null }, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "d_model": 4096, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "emb_pdrop": 0, "embedding_fraction": 1.0, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "expansion_ratio": 4, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "init_config": { "emb_init_std": null, "emb_init_uniform_lim": null, "fan_mode": "fan_in", "init_div_is_residual": true, "init_gain": 0, "init_nonlinearity": "relu", "init_std": 0.02, "name": "kaimingnormal", "verbose": 0 }, "init_device": "cpu", "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "learned_pos_emb": true, "length_penalty": 1.0, "logit_scale": null, "max_length": 20, "max_seq_len": 2048, "min_length": 0, "model_type": "mpt", "n_heads": 32, "n_layers": 32, "no_bias": true, "no_repeat_ngram_size": 0, "norm_type": "low_precision_layernorm", "num_beam_groups": 1, "num_beams": 1, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "resid_pdrop": 0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "tokenizer_name": "EleutherAI/gpt-neox-20b", "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.31.0", "typical_p": 1.0, "use_bfloat16": false, "use_cache": false, "verbose": 0, "vocab_size": 50432 }, "torch_dtype": "float32", "transformers_version": null, "use_media_placement_augmentation": true, "vision_config": { "_name_or_path": "openai/clip-vit-large-patch14", "add_cross_attention": false, "architectures": null, "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "quick_gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 224, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-05, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "clip_vision_model", "no_repeat_ngram_size": 0, "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "projection_dim": 512, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": null, "torchscript": false, "transformers_version": "4.31.0", "typical_p": 1.0, "use_bfloat16": false } }

xmc-andy commented 1 year ago

I checked the save_pretrained part as you said, I'm using the version about a month ago, the save code is as follows" unwrapped_model = accelerator.unwrap_model(model) checkpoint_dict = get_checkpoint(model=unwrapped_model) accelerator.save( checkpoint_dict, f"{args.external_save_dir}/final_weights.pt", )

save the config

         unwrapped_model.config.save_pretrained(args.external_save_dir)"

I am not sure if this is part of the reason. I will try to use the latest branch to train new weights to see if the problem is solved. In addition, I will also try “pip install - U otter_ai from otter_ai import OtterForConditionalGeneration" , thank you very much!

Luodian commented 1 year ago

I would suggest you to use the save_pretrained method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to a path

You can also load it using OtterForConditionalGeneration.from_pretrained("path").

So this process would be safer and wont cause the missing weights problems.

xmc-andy commented 1 year ago

I would suggest you to use the save_pretrained method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to a path

You can also load it using OtterForConditionalGeneration.from_pretrained("path").

So this process would be safer and wont cause the missing weights problems.

Got it, I'll take your advice and try it.

xmc-andy commented 1 year ago

Hey, I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?

Luodian commented 1 year ago

It could be right. If you confirm that the config.json is the same.

xmc-andy commented 1 year ago

"transformers_version"

Got it, the generated config.json only has "_name_or_path" and ""transformers_version"" different from what you posted.

xmc-andy commented 1 year ago

It could be right. If you confirm that the config.json is the same.

Thank you very much for your careful answer. I have solved this bug. The cause of the problem is that the "_name_or_path" of the generated config.json is derived from the parameter "pretrained_model_name_or_path" during training, but during inference it seems that "_name_or_path" requires " flamingo" field, so using the config.json you posted instead of the generated config.json is valid.

xmc-andy commented 1 year ago

Hey, I would like to ask you that I am doing a two-classification task with single prompt and multiple images as input, but the result does not seem to be very good. Do you have any ideas for possible improvements? Currently I plan to try to unfreeze the visual encoder. Hope to share your suggestions.

Luodian commented 1 year ago

If you are doing with multiple images as input. You could first try to arrange them into the F dimension of the vision_x.

For this model training, I suggest you to add a max_num_frames=N where N is your maximum number input images.

You can still init from Image model, if with the max_num_frames variable, the model turns to a Video model. You can see that from the training log when init the model.

This is like treating your input images as video sequences.

Luodian commented 1 year ago

Also, another way is to put images in the in_context dim, where is one dim before F.

vision_x is in dimension of B, T, F, C, H, W.

If doing so, you wont need to add above mentioned max_num_frames=N.

Luodian commented 1 year ago

Or you could use --customized_config arg in instruction_following.py to dynamically load a new config.json file (this operation would overwrite model's config.json).

Inside this customized config, you can choose whether give the max_num_frames=N.

Luodian commented 1 year ago
image
xmc-andy commented 1 year ago

Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.

iz2late commented 8 months ago

Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.

any results for your multi-image input experiments? i'm planning to do similar things and wondering if you have any insights of which approach is better.