how can I replace tokenizer TinyChart

LilDevsy0117 commented 3 months ago

I want to change the tokenizer so that it can be applied to Korean

I would appreciate it if you could change LLM_PATH and additionally let me know which part of the code should be modified.

zhangliang-04 commented 3 months ago

Hi @LilDevsy0117, Our model is based on TinyLlava, which uses Phi-2 as the LLM. It uses Byte-level BPE and can be naturally applied to other languages including Korean after fine-tuning your data. However, we cannot guarantee its performance, which is related to the Korean support of the Phi-2. If you want to use another LLM, try to write a class like llava_phi that support your LLM to receive image features from ViT. Note that you may pre-train the projector and LLM on the general image-text datasets to get better performance just like TinyLlava does.

LilDevsy0117 commented 3 months ago

Thanks @zhangliang-04 I am trying as below.

train.sh LLM_PATH=tabtoyou/KoLLaVA-v1.5-Synatra-7b # replaced VIT_PATH=mPLUG/TinyChart-3B-768-siglip

and write llava_synatra.py

modified train.py

config = LlavaConfig.from_pretrained(model_args.model_name_or_path) 86 model = LlavaLlamaForCausalLM.from_pretrained( 87 model_args.model_name_or_path, 88 config = config, 89 cache_dir=training_args.cache_dir, 90 **bnb_model_from_pretrained_args, 91 attn_implementation=None, 92 torch_dtype=compute_dtype 93 )

Tokenizer, init_tokenizer = TokenizerSelect('synatra')() 118 tokenizer = Tokenizer.from_pretrained( 119 model_args.model_name_or_path, 120 cache_dir=training_args.cache_dir, 121 model_max_length=training_args.model_max_length, 122 padding_side="right", 123 use_fast=True, 124 ) 125 tokenizer = init_tokenizer(tokenizer)

However, I encountered the following error.

You are using a model of type llava to instantiate a model of type tiny_chart_synatra. This is not supported for all configurations of models and can yield errors. WARNING:tinychart.model.multimodal_encoder.siglip_encoder:You are using a model of type clip to instantiate a model of type siglip_vision_model. This is not supported for all configurations of models and can yield errors.

WARNING: tokenization mismatch: 203 vs. 210. (ignored) number of rounds: 1 rounds: ["A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nGenerate underlying data table of the chart. ASSISTANT: TITLE | 기술 및 기능부문에서 장비 규격 유지관리의 평가방법은 무엇인가 \n | 일반부문 및 안개요(20점) | 장비 기능 요구사항 \n 0 | 40 | 18 \n 1 | 31 | 2 \n 2 | 8 | 4 \n 3 | 20 | 9 \n 4 | 14 | 34 \n 5 | 20 | 41 \n 6 | 20 | 7"] conversation: ["A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nGenerate underlying data table of the chart. ASSISTANT: TITLE | 기술 및 기능부문에 서 장비 규격 유지관리의 평가방법은 무엇인가 \n | 일반부문 및 안개요(20점) | 장비 기능 요구사항 \n 0 | 40 | 18 \n 1 | 31 | 2 \n 2 | 8 | 4 \n 3 | 20 | 9 \n 4 | 14 | 34 \n 5 | 20 | 41 \n 6 | 20 | 7<|endoftext|>"] tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 320, 1153, 1180, 342, 28705, 29164, 239, 139, 163, 28705, 31262, 28705, 29164, 30364, 29775, 29710, 29148, 29305, 28705, 29747, 29859, 28705, 31982, 31110, 28705, 30127, 29161, 30224, 29288, 29187, 28705, 31523, 29135, 30240, 30979, 29538, 28705, 30449, 239, 154, 138, 29324, 29135, 28705, 13, 28705, 342, 28705, 29415, 30192, 29775, 29710, 28705, 31262, 28705, 30325, 29893, 29517, 28732, 28750, 28734, 30589, 28731, 342, 28705, 29747, 29859, 28705, 29164, 30364, 28705, 29517, 29779, 29315, 30968, 28705, 13, 28705, 28734, 342, 28705, 28781, 28734, 342, 28705, 28740, 28783, 28705, 13, 28705, 28740, 342, 28705, 28770, 28740, 342, 28705, 28750, 28705, 13, 28705, 28750, 342, 28705, 28783, 342, 28705, 28781, 28705, 13, 28705, 28770, 342, 28705, 28750, 28734, 342, 28705, 28774, 28705, 13, 28705, 28781, 342, 28705, 28740, 28781, 342, 28705, 28770, 28781, 28705, 13, 28705, 28782, 342, 28705, 28750, 28734, 342, 28705, 28781, 28740, 28705, 13, 28705, 28784, 342, 28705, 28750, 28734, 342, 28705, 28787, 28789, 28766, 416, 1009, 772, 28766, 28767]) tensor([[ 1, 330, 10706, 1444, 264, 13903, 2188, 304, 396, 18278, 10895, 13892, 28723, 415, 13892, 5212, 10865, 28725, 10537, 28725, 304, 27057, 11194, 298, 272, 2188, 28742, 28713, 4224, 28723, 2223, 725, 28747, 28705, -200, 28705, 13, 23342, 14164, 1178, 2401, 302, 272, 10968, 28723, 8602, 8048, 12738, 28747, 320, 1153, 1180, 342, 28705, 29164, 239, 139, 163, 28705, 31262, 28705, 29164, 30364, 29775, 29710, 29148, 29305, 28705, 29747, 29859, 28705, 31982, 31110, 28705, 30127, 29161, 30224, 29288, 29187, 28705, 31523, 29135, 30240, 30979, 29538, 28705, 30449, 239, 154, 138, 29324, 29135, 28705, 13, 28705, 342, 28705, 29415, 30192, 29775, 29710, 28705, 31262, 28705, 30325, 29893, 29517, 28732, 28750, 28734, 30589, 28731, 342, 28705, 29747, 29859, 28705, 29164, 30364, 28705, 29517, 29779, 29315, 30968, 28705, 13, 28705, 28734, 342, 28705, 28781, 28734, 342, 28705, 28740, 28783, 28705, 13, 28705, 28740, 342, 28705, 28770, 28740, 342, 28705, 28750, 28705, 13, 28705, 28750, 342, 28705, 28783, 342, 28705, 28781, 28705, 13, 28705, 28770, 342, 28705, 28750, 28734, 342, 28705, 28774, 28705, 13, 28705, 28781, 342, 28705, 28740, 28781, 342, 28705, 28770, 28781, 28705, 13, 28705, 28782, 342, 28705, 28750, 28734, 342, 28705, 28781, 28740, 28705, 13, 28705, 28784, 342, 28705, 28750, 28734, 342, 28705, 28787, 28789, 28766, 416, 1009, 772, 28766, 28767]])

X-PLUG / mPLUG-DocOwl

how can I replace tokenizer TinyChart #94