OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.82k stars 543 forks source link

checkpoint shards not loading. Process always gets send to SIGTERM #312

Open xsMarc opened 6 days ago

xsMarc commented 6 days ago

any help is appreciated (:

Loading checkpoint shards: 14%|███████ | 1/7 [00:13<01:22, 13.68s/it]W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29734 closing signal SIGTERM W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29735 closing signal SIGTERM W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29737 closing signal SIGTERM E0629 00:06:24.298000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 2 (pid: 29736) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

LDLINGLINGLING commented 3 days ago

please provide your code