max_steps is given, it will override any value given in num_train_epochs Traceback (most recent call last): File "/kaggle/working/MiniCPM-V/finetune/finetune.py", line 328, in train() File "/kaggle/working/MiniCPM-V/finetune/finetune.py", line 318, in train trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1291, in prepare result = self._prepare_deepspeed(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1758, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs) File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 240, in init self._do_sanity_check() File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1032, in _do_sanity_check raise ValueError("Type fp16 is not supported.") ValueError: Type fp16 is not supported. [2024-06-19 00:17:44,932] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 244) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-19_00:17:44 host : d90c1cf96f39 rank : 0 (local_rank: 0) exitcode : 1 (pid: 244) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### 期望行为 | Expected Behavior _No response_ ### 复现方法 | Steps To Reproduce _No response_ ### 运行环境 | Environment ```Markdown - OS: - Python: - Transformers: - PyTorch: - CUDA (`python -c 'import torch; print(torch.version.cuda)'`): ``` ### 备注 | Anything else? _No response_

OpenBMB / MiniCPM-V

while trying to fine tune model on kaggle this error appear :ValueError: Type fp16 is not supported. #283

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

finetune.py FAILED