Closed neverstoplearn closed 1 year ago
{
"image": ["images/33010001-2022_10_20_09_59_56_HANG_CLOTHES_OUT.jpg"],
"text": "以下是一个好奇的人类和人工智能助手之间的对话。助理对用户的问题提供有用、详细且礼貌的回答。\nHuman: <image>
\nHuman: 这张图片里有哪些违规事件类型?\nAI: 这张图片的违规事件类型有道路不洁",
"task_type": "gpt4instruct_sft"
}
And I also want to know is this right?
You can directly use BloomTokenizerFast instead of MplugOwlTokenizer. And your data format is correct.
hello! have u solve the problem?
By directily using BloomTokenizerFast, do i have to change the code except line 141 from train.py
:
tokenizer = BloomTokenizerFast.from_pretrained(args.pretrained_ckpt)
# tokenizer = MplugOwlTokenizer.from_pretrained(args.pretrained_ckpt)
However, after i changing the code, i got such error:
Traceback (most recent call last):
File "/workdir/./pipeline/train.py", line 217, in <module>
main()
File "/workdir/./pipeline/train.py", line 177, in main
train_data, valid_data = train_valid_test_datasets_provider(
File "/workdir/pipeline/data_utils/__init__.py", line 7, in train_valid_test_datasets_provider
train_ds, valid_ds = build_train_valid_test_datasets(
File "/workdir/pipeline/data_utils/__init__.py", line 21, in build_train_valid_test_datasets
train_ds = MultiModalDataset(input_file[0], tokenizer, train_processors, max_length)
File "/workdir/pipeline/data_utils/xgpt3_dataset.py", line 49, in __init__
self.dataset += load_jsonl(input_file)
File "/workdir/pipeline/data_utils/xgpt3_dataset.py", line 35, in load_jsonl
return [json.loads(l.strip("\n")) for l in f.readlines()]
File "/workdir/pipeline/data_utils/xgpt3_dataset.py", line 35, in <listcomp>
return [json.loads(l.strip("\n")) for l in f.readlines()]
File "/workdir/conda_envs/mplug_owl/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/workdir/conda_envs/mplug_owl/lib/python3.10/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 4 (char 3)
How to fix this? is there other issues should i concern? thanks!
and also if this repo could fintune the multilange model? I got this error: TypeError: not a string The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BloomTokenizerFast'. The class this function is called from is 'MplugOwlTokenizer'.