neverstoplearn commented 1 year ago

CUDA SETUP: Loading binary /home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... install flash-attn first. install flash-attn first. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/user/zx/mPLUG-Owl/./pipeline/train.py:102 in │ │ │ │ 99 │ │ 100 │ │ 101 │ │ ❱ 102 class CustomTrainer(Trainer): │ │ 103 │ def init(self, kwargs): │ │ 104 │ │ super().init(kwargs) │ │ 105 │ │ │ │ /home/user/zx/mPLUG-Owl/./pipeline/train.py:118 in CustomTrainer │ │ │ │ 115 │ │ │ collate_fn=batchify) │ │ 116 │ │ │ 117 │ │ │ ❱ 118 │ def get_eval_dataloader(self, eval_dataset: Dataset | None = None) -> DataLoader: │ │ 119 │ │ dataset = self.eval_dataset │ │ 120 │ │ sampler = DistributedSampler(dataset, shuffle=False) │ │ 121 │ │ return torch.utils.data.DataLoader( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/user/zx/mPLUG-Owl/./pipeline/train.py:102 in │ │ │ │ 99 │ │ 100 │ │ 101 │ │ ❱ 102 class CustomTrainer(Trainer): │ │ 103 │ def init(self, kwargs): │ │ 104 │ │ super().init(kwargs) │ │ 105 │ │ │ │ /home/user/zx/mPLUG-Owl/./pipeline/train.py:118 in CustomTrainer │ │ │ │ 115 │ │ │ collate_fn=batchify) │ │ 116 │ │ │ 117 │ │ │ ❱ 118 │ def get_eval_dataloader(self, eval_dataset: Dataset | None = None) -> DataLoader: │ │ 119 │ │ dataset = self.eval_dataset │ │ 120 │ │ sampler = DistributedSampler(dataset, shuffle=False) │ │ 121 │ │ return torch.utils.data.DataLoader( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 61511) of binary: /home/user/anaconda3/envs/pytorch1.13/bin/python Traceback (most recent call last): File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in main() File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/anaconda3/envs/pytorch1.13/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./pipeline/train.py FAILED

Failures: [1]: time : 2023-08-21_18:46:36 host : user rank : 1 (local_rank: 1) exitcode : 1 (pid: 61512) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-21_18:46:36 host : user rank : 0 (local_rank: 0) exitcode : 1 (pid: 61511) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

neverstoplearn commented 1 year ago

It seems that I got the error dataset, I use custom dataset to fitintue the model. I got the dataset like this in my sft_train.jsonl { "image": ["images/33010001-2022_10_20_09_59_56_HANG_CLOTHES_OUT.jpg"], "text": "以下是一个好奇的人类和人工智能助手之间的对话。助理对用户的问题提供有用、详细且礼貌的回答。\nHuman: \nHuman: 这张图片里有哪些违规事件类型？\nAI: 这张图片的违规事件类型有道路不洁", "task_type": "gpt4instruct_sft" } how can I fix it? thanks.

MAGAer13 commented 1 year ago

You did not add <image> token in your data.

neverstoplearn commented 1 year ago

You did not add <image> token in your data.

I am sorry. you mean that: { "image": ["images/33010001-2022_10_20_09_59_56_HANG_CLOTHES_OUT.jpg"], "text": "以下是一个好奇的人类和人工智能助手之间的对话。助理对用户的问题提供有用、详细且礼貌的回答。\nHuman: <image>\nHuman: 这张图片里有哪些违规事件类型？\nAI: 这张图片的违规事件类型有道路不洁", "task_type": "gpt4instruct_sft" } <image> need change to or ?

neverstoplearn commented 1 year ago

and also if this repo could fintune the multilange model? I got this error: TypeError: not a string The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BloomTokenizerFast'. The class this function is called from is 'MplugOwlTokenizer'.

X-PLUG / mPLUG-Owl

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' #139

./pipeline/train.py FAILED

Failures: [1]: time : 2023-08-21_18:46:36 host : user rank : 1 (local_rank: 1) exitcode : 1 (pid: 61512) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-21_18:46:36 host : user rank : 0 (local_rank: 0) exitcode : 1 (pid: 61511) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html