Open xmc-andy opened 1 year ago
May I know your task type and which version of Otter model you are using for initialization?
May I know your task type and which version of Otter model you are using for initialization?
I am doing a classification task, with multiple images and a single prompt as input, in SD data set format, and the pre-training weights are "OTTER-Image-MPT7B"
export PYTHONPATH=.
accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \ pipeline/train/instruction_following.py \ --pretrained_model_name_or_path /mnt/large_model/weights/OTTER-Image-MPT7B_git \ --mimicit_vt_path /mnt/large_model/output/XX/SD_instruction.json \ --images_vt_path /mnt/large_model/output/XX/SD.json \ --external_save_dir /mnt/large_model/output/XX/OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3 \ --batch_size 1 \ --num_epochs 15 \ --run_name OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3 \ --workers 24 \ --lr_scheduler cosine \ --learning_rate 1e-5 \ --max-src-length 256 \ --warmup_steps_ratio 0.01 \ --save_ckpt_each_epoch \ --delete_previous_checkpoint \ --report_to_wandb \
Will the missing weights log appear after you directly load the model? You may breakpoint after loading process finished.
When loading the pre-trained weights you posted and the baseline weights I trained, there will be no log loss of weights, but there will be when loading the newly trained model weights. Sorry, I can't find the location and reason of the log information
By the way, due to network problems, the network cannot download tokenizer_config.json from huggingface's MPT, so I downloaded it offline through "https://huggingface.co/mosaicml/mpt-7b-instruct", except for the bin file and in modeling_otter.py The modified code is text_tokenizer = AutoTokenizer.from_pretrained("/mnt/train_pipeline-master/Otter/mpt-7b-instruct")
Could you download the model's config from this path?
https://openxlab.org.cn/models/detail/YuanhanZhang/OTTER-Image-MPT7B
The config.json
should be in following format:
{
"_commit_hash": null,
"_name_or_path": "/mnt/petrelfs/zhangyuanhan/weights/flamingo-mpt-7B",
"architectures": [
"OtterForConditionalGeneration"
],
"cross_attn_every_n_layers": 4,
"model_type": "otter",
"text_config": {
"_name_or_path": "",
"add_cross_attention": false,
"architectures": [
"MPTForCausalLM"
],
"attn_config": {
"alibi": true,
"alibi_bias_max": 8,
"attn_impl": "torch",
"attn_pdrop": 0,
"attn_type": "multihead_attention",
"attn_uses_sequence_id": false,
"clip_qkv": null,
"prefix_lm": false,
"qk_ln": false,
"softmax_scale": null
},
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"d_model": 4096,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"emb_pdrop": 0,
"embedding_fraction": 1.0,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"expansion_ratio": 4,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_size": 4096,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"init_config": {
"emb_init_std": null,
"emb_init_uniform_lim": null,
"fan_mode": "fan_in",
"init_div_is_residual": true,
"init_gain": 0,
"init_nonlinearity": "relu",
"init_std": 0.02,
"name": "kaiming_normal_",
"verbose": 0
},
"init_device": "cpu",
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"learned_pos_emb": true,
"length_penalty": 1.0,
"logit_scale": null,
"max_length": 20,
"max_seq_len": 2048,
"min_length": 0,
"model_type": "mpt",
"n_heads": 32,
"n_layers": 32,
"no_bias": true,
"no_repeat_ngram_size": 0,
"norm_type": "low_precision_layernorm",
"num_beam_groups": 1,
"num_beams": 1,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"resid_pdrop": 0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"tokenizer_name": "EleutherAI/gpt-neox-20b",
"top_k": 50,
"top_p": 1.0,
"torch_dtype": "bfloat16",
"torchscript": false,
"transformers_version": "4.30.1",
"typical_p": 1.0,
"use_bfloat16": false,
"use_cache": false,
"verbose": 0,
"vocab_size": 50432
},
"torch_dtype": "float32",
"transformers_version": null,
"use_media_placement_augmentation": true,
"vision_config": {
"_name_or_path": "openai/clip-vit-large-patch14",
"add_cross_attention": false,
"architectures": null,
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_act": "quick_gelu",
"hidden_size": 1024,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"image_size": 224,
"initializer_factor": 1.0,
"initializer_range": 0.02,
"intermediate_size": 4096,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-05,
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_type": "clip_vision_model",
"no_repeat_ngram_size": 0,
"num_attention_heads": 16,
"num_beam_groups": 1,
"num_beams": 1,
"num_channels": 3,
"num_hidden_layers": 24,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"patch_size": 14,
"prefix": null,
"problem_type": null,
"projection_dim": 512,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": null,
"torchscript": false,
"transformers_version": "4.30.1",
"typical_p": 1.0,
"use_bfloat16": false
}
}
And also, make sure you use the save_pretrained
method to save checkpoints.
unwrapped_model.save_pretrained(
f"{args.external_save_dir}",
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
state_dict=checkpoint_dict,
)
The missing of position_ids
is usually from the LLM part. And make sure you are using the latest branch code for loading Otter model when init.
And now you can try pip install -U otter_ai
.
And then from otter_ai import OtterForConditionalGeneration
.
That will automatically handle to loading of modeling_mpt.py
.
I compared the config.json. Except for "_name_or_path" and "transformers_version", the rest are consistent with what you posted. This should not be the problem. I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?
{ "_commit_hash": null, "_name_or_path": "/mnt/large_model/weights/OTTER-Image-MPT7B_git", "architectures": [ "OtterForConditionalGeneration" ], "cross_attn_every_n_layers": 4, "model_type": "otter", "text_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "MPTForCausalLM" ], "attn_config": { "alibi": true, "alibi_bias_max": 8, "attn_impl": "torch", "attn_pdrop": 0, "attn_type": "multihead_attention", "attn_uses_sequence_id": false, "clip_qkv": null, "prefix_lm": false, "qk_ln": false, "softmax_scale": null }, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "d_model": 4096, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "emb_pdrop": 0, "embedding_fraction": 1.0, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "expansion_ratio": 4, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "init_config": { "emb_init_std": null, "emb_init_uniform_lim": null, "fan_mode": "fan_in", "init_div_is_residual": true, "init_gain": 0, "init_nonlinearity": "relu", "init_std": 0.02, "name": "kaimingnormal", "verbose": 0 }, "init_device": "cpu", "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "learned_pos_emb": true, "length_penalty": 1.0, "logit_scale": null, "max_length": 20, "max_seq_len": 2048, "min_length": 0, "model_type": "mpt", "n_heads": 32, "n_layers": 32, "no_bias": true, "no_repeat_ngram_size": 0, "norm_type": "low_precision_layernorm", "num_beam_groups": 1, "num_beams": 1, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "resid_pdrop": 0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "tokenizer_name": "EleutherAI/gpt-neox-20b", "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.31.0", "typical_p": 1.0, "use_bfloat16": false, "use_cache": false, "verbose": 0, "vocab_size": 50432 }, "torch_dtype": "float32", "transformers_version": null, "use_media_placement_augmentation": true, "vision_config": { "_name_or_path": "openai/clip-vit-large-patch14", "add_cross_attention": false, "architectures": null, "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "quick_gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 224, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-05, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "clip_vision_model", "no_repeat_ngram_size": 0, "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "projection_dim": 512, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": null, "torchscript": false, "transformers_version": "4.31.0", "typical_p": 1.0, "use_bfloat16": false } }
I checked the save_pretrained part as you said, I'm using the version about a month ago, the save code is as follows" unwrapped_model = accelerator.unwrap_model(model) checkpoint_dict = get_checkpoint(model=unwrapped_model) accelerator.save( checkpoint_dict, f"{args.external_save_dir}/final_weights.pt", )
unwrapped_model.config.save_pretrained(args.external_save_dir)"
I am not sure if this is part of the reason. I will try to use the latest branch to train new weights to see if the problem is solved. In addition, I will also try “pip install - U otter_ai from otter_ai import OtterForConditionalGeneration" , thank you very much!
I would suggest you to use the save_pretrained
method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to a path
You can also load it using OtterForConditionalGeneration.from_pretrained("path")
.
So this process would be safer and wont cause the missing weights problems.
I would suggest you to use the
save_pretrained
method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to apath
You can also load it using
OtterForConditionalGeneration.from_pretrained("path")
.So this process would be safer and wont cause the missing weights problems.
Got it, I'll take your advice and try it.
Hey, I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?
It could be right. If you confirm that the config.json
is the same.
"transformers_version"
Got it, the generated config.json only has "_name_or_path" and ""transformers_version"" different from what you posted.
It could be right. If you confirm that the
config.json
is the same.
Thank you very much for your careful answer. I have solved this bug. The cause of the problem is that the "_name_or_path" of the generated config.json is derived from the parameter "pretrained_model_name_or_path" during training, but during inference it seems that "_name_or_path" requires " flamingo" field, so using the config.json you posted instead of the generated config.json is valid.
Hey, I would like to ask you that I am doing a two-classification task with single prompt and multiple images as input, but the result does not seem to be very good. Do you have any ideas for possible improvements? Currently I plan to try to unfreeze the visual encoder. Hope to share your suggestions.
If you are doing with multiple images as input. You could first try to arrange them into the F
dimension of the vision_x
.
For this model training, I suggest you to add a max_num_frames=N
where N
is your maximum number input images.
You can still init from Image model, if with the max_num_frames
variable, the model turns to a Video model. You can see that from the training log when init the model.
This is like treating your input images as video sequences.
Also, another way is to put images in the in_context
dim, where is one dim before F
.
vision_x
is in dimension of B, T, F, C, H, W
.
If doing so, you wont need to add above mentioned max_num_frames=N
.
Or you could use --customized_config
arg in instruction_following.py
to dynamically load a new config.json
file (this operation would overwrite model's config.json).
Inside this customized config, you can choose whether give the max_num_frames=N
.
Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.
Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.
any results for your multi-image input experiments? i'm planning to do similar things and wondering if you have any insights of which approach is better.
Hello, I encountered such an output when testing the trained weights. I spent a long time trying to find out the reason. Unfortunately, I haven't found out the cause of this problem yet. Can you help me? I once used official weights to train a baseline on my own data for classification. The results were not very good, but the prompt "Some weights of OtterForConditionalGeneration were not initialized, and are newly initialized" did not appear. When I trained another version This situation occurred after testing the model.
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:30<00:00, 7.62s/it] Some weights of OtterForConditionalGeneration were not initialized from the model checkpoint at /mnt/large_model/weights/BC4-partScale-negAug3 and are newly initialized: ['vision_encoder.vision_model.embeddings.position_ids'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.