X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.25k stars 171 forks source link

TypeError: not a string #141

Closed neverstoplearn closed 1 year ago

neverstoplearn commented 1 year ago

and also if this repo could fintune the multilange model? I got this error: TypeError: not a string The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BloomTokenizerFast'. The class this function is called from is 'MplugOwlTokenizer'.

neverstoplearn commented 1 year ago

{ "image": ["images/33010001-2022_10_20_09_59_56_HANG_CLOTHES_OUT.jpg"], "text": "以下是一个好奇的人类和人工智能助手之间的对话。助理对用户的问题提供有用、详细且礼貌的回答。\nHuman: <image>\nHuman: 这张图片里有哪些违规事件类型?\nAI: 这张图片的违规事件类型有道路不洁", "task_type": "gpt4instruct_sft" }
And I also want to know is this right?

MAGAer13 commented 1 year ago

You can directly use BloomTokenizerFast instead of MplugOwlTokenizer. And your data format is correct.

TonyAlbertWan commented 1 year ago

hello! have u solve the problem? By directily using BloomTokenizerFast, do i have to change the code except line 141 from train.py:

tokenizer = BloomTokenizerFast.from_pretrained(args.pretrained_ckpt)
# tokenizer = MplugOwlTokenizer.from_pretrained(args.pretrained_ckpt)

However, after i changing the code, i got such error:

Traceback (most recent call last):
  File "/workdir/./pipeline/train.py", line 217, in <module>
    main()
  File "/workdir/./pipeline/train.py", line 177, in main
    train_data, valid_data = train_valid_test_datasets_provider(
  File "/workdir/pipeline/data_utils/__init__.py", line 7, in train_valid_test_datasets_provider
    train_ds, valid_ds = build_train_valid_test_datasets(
  File "/workdir/pipeline/data_utils/__init__.py", line 21, in build_train_valid_test_datasets
    train_ds = MultiModalDataset(input_file[0], tokenizer, train_processors, max_length)
  File "/workdir/pipeline/data_utils/xgpt3_dataset.py", line 49, in __init__
    self.dataset += load_jsonl(input_file)
  File "/workdir/pipeline/data_utils/xgpt3_dataset.py", line 35, in load_jsonl
    return [json.loads(l.strip("\n")) for l in f.readlines()]
  File "/workdir/pipeline/data_utils/xgpt3_dataset.py", line 35, in <listcomp>
    return [json.loads(l.strip("\n")) for l in f.readlines()]
  File "/workdir/conda_envs/mplug_owl/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/workdir/conda_envs/mplug_owl/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 4 (char 3)

How to fix this? is there other issues should i concern? thanks!