Closed neverstoplearn closed 1 year ago
It seems that I got the error dataset, I use custom dataset to fitintue the model.
I got the dataset like this in my sft_train.jsonl
{
"image": ["images/33010001-2022_10_20_09_59_56_HANG_CLOTHES_OUT.jpg"],
"text": "以下是一个好奇的人类和人工智能助手之间的对话。助理对用户的问题提供有用、详细且礼貌的回答。\nHuman:
You did not add <image>
token in your data.
You did not add
<image>
token in your data.
I am sorry.
you mean that:
{
"image": ["images/33010001-2022_10_20_09_59_56_HANG_CLOTHES_OUT.jpg"],
"text": "以下是一个好奇的人类和人工智能助手之间的对话。助理对用户的问题提供有用、详细且礼貌的回答。\nHuman: <image>
\nHuman: 这张图片里有哪些违规事件类型?\nAI: 这张图片的违规事件类型有道路不洁",
"task_type": "gpt4instruct_sft"
}
<image>
need change to or ?
and also if this repo could fintune the multilange model? I got this error: TypeError: not a string The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BloomTokenizerFast'. The class this function is called from is 'MplugOwlTokenizer'.
CUDA SETUP: Loading binary /home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... install flash-attn first. install flash-attn first. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/user/zx/mPLUG-Owl/./pipeline/train.py:102 in │
│ │
│ 99 │
│ 100 │
│ 101 │
│ ❱ 102 class CustomTrainer(Trainer): │
│ 103 │ def init(self, kwargs): │
│ 104 │ │ super().init(kwargs) │
│ 105 │
│ │
│ /home/user/zx/mPLUG-Owl/./pipeline/train.py:118 in CustomTrainer │
│ │
│ 115 │ │ │ collate_fn=batchify) │
│ 116 │ │
│ 117 │ │
│ ❱ 118 │ def get_eval_dataloader(self, eval_dataset: Dataset | None = None) -> DataLoader: │
│ 119 │ │ dataset = self.eval_dataset │
│ 120 │ │ sampler = DistributedSampler(dataset, shuffle=False) │
│ 121 │ │ return torch.utils.data.DataLoader( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/user/zx/mPLUG-Owl/./pipeline/train.py:102 in │
│ │
│ 99 │
│ 100 │
│ 101 │
│ ❱ 102 class CustomTrainer(Trainer): │
│ 103 │ def init(self, kwargs): │
│ 104 │ │ super().init(kwargs) │
│ 105 │
│ │
│ /home/user/zx/mPLUG-Owl/./pipeline/train.py:118 in CustomTrainer │
│ │
│ 115 │ │ │ collate_fn=batchify) │
│ 116 │ │
│ 117 │ │
│ ❱ 118 │ def get_eval_dataloader(self, eval_dataset: Dataset | None = None) -> DataLoader: │
│ 119 │ │ dataset = self.eval_dataset │
│ 120 │ │ sampler = DistributedSampler(dataset, shuffle=False) │
│ 121 │ │ return torch.utils.data.DataLoader( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 61511) of binary: /home/user/anaconda3/envs/pytorch1.13/bin/python
Traceback (most recent call last):
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./pipeline/train.py FAILED
Failures: [1]: time : 2023-08-21_18:46:36 host : user rank : 1 (local_rank: 1) exitcode : 1 (pid: 61512) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-08-21_18:46:36 host : user rank : 0 (local_rank: 0) exitcode : 1 (pid: 61511) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html