Open LilDevsy0117 opened 3 months ago
Hi @LilDevsy0117, Our model is based on TinyLlava, which uses Phi-2 as the LLM. It uses Byte-level BPE and can be naturally applied to other languages including Korean after fine-tuning your data. However, we cannot guarantee its performance, which is related to the Korean support of the Phi-2. If you want to use another LLM, try to write a class like llava_phi that support your LLM to receive image features from ViT. Note that you may pre-train the projector and LLM on the general image-text datasets to get better performance just like TinyLlava does.
Thanks @zhangliang-04 I am trying as below.
train.sh LLM_PATH=tabtoyou/KoLLaVA-v1.5-Synatra-7b # replaced VIT_PATH=mPLUG/TinyChart-3B-768-siglip
and write llava_synatra.py
modified train.py
config = LlavaConfig.from_pretrained(model_args.model_name_or_path) 86 model = LlavaLlamaForCausalLM.from_pretrained( 87 model_args.model_name_or_path, 88 config = config, 89 cache_dir=training_args.cache_dir, 90 **bnb_model_from_pretrained_args, 91 attn_implementation=None, 92 torch_dtype=compute_dtype 93 )
Tokenizer, init_tokenizer = TokenizerSelect('synatra')() 118 tokenizer = Tokenizer.from_pretrained( 119 model_args.model_name_or_path, 120 cache_dir=training_args.cache_dir, 121 model_max_length=training_args.model_max_length, 122 padding_side="right", 123 use_fast=True, 124 ) 125 tokenizer = init_tokenizer(tokenizer)
However, I encountered the following error.
You are using a model of type llava to instantiate a model of type tiny_chart_synatra. This is not supported for all configurations of models and can yield errors. WARNING:tinychart.model.multimodal_encoder.siglip_encoder:You are using a model of type clip to instantiate a model of type siglip_vision_model. This is not supported for all configurations of models and can yield errors.
WARNING: tokenization mismatch: 203 vs. 210. (ignored)
number of rounds: 1
rounds: ["A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
I want to change the tokenizer so that it can be applied to Korean
I would appreciate it if you could change LLM_PATH and additionally let me know which part of the code should be modified.