DLLXW / baby-llama2-chinese

用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库;24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.
MIT License
2.34k stars 288 forks source link

请问下这个报错是哪里配置的不对吗? #62

Open beginner-wj opened 4 months ago

beginner-wj commented 4 months ago

torchrun --standalone --nproc_per_node=4 pretrain.py OR python -m torch.distributed.launch --nproc_per_node=1 pretrain. py [2024-03-07 17:54:29,710] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] [2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] [2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] tokens per iteration will be: 32,768 breaks down as: 1 grad accum steps 4 processes 16 batch size * 512 max seq len memmap:True train data.shape:(6936803, 512) downloading finished..... Initializing a new model from scratch Traceback (most recent call last): File "pretrain.py", line 239, in torch.cuda.set_device(device) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda__init.py", line 408, in set_device Traceback (most recent call last): File "pretrain.py", line 239, in Traceback (most recent call last): torch._C._cuda_setDevice(device) File "pretrain.py", line 239, in torch.cuda.set_device(device)RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda\init__.py", line 408, in set_device

torch.cuda.set_device(device) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda__init__.py", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

torch._C._cuda_setDevice(device)

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

num decayed parameter tensors: 57, with 58,470,912 parameters num non-decayed parameter tensors: 17, with 8,704 parameters using fused AdamW: True [2024-03-07 17:54:34,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal CTRL_C_EVENT [2024-03-07 17:55:04,909] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers [2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGINT [2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGTERM Traceback (most recent call last): File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run result = self._invoke_run(role) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 869, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper result = f(*args, **kwargs) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 329, in _monit or_workers result = self._pcontext.wait(0) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 277, in wait return self._poll() File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 661, in _poll self.close() # terminate all running procs File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close self._close(death_sig=death_sig, timeout=timeout) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 706, in _close handler.proc.wait(time_to_wait) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1079, in wait return self._wait(timeout=timeout) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1357, in _wait result = _winapi.WaitForSingleObject(self._handle, File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_h andler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1860 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\123\AppData\Local\Programs\Python\Python38\Scripts\torchrun.exe__main.py", line 7, in File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\errors__init__.py", line 347, in wrapper return f(*args, **kwargs) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 812, in main run(args) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 803, in run elastic_launch( File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 135, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent result = agent.run() File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper result = f(*args, **kwargs) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 734, in run self._shutdown(e.sigval) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 311, in _shutd own self._pcontext.close(death_sig) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close self._close(death_sig=death_sig, timeout=timeout) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 699, in _close handler.close(death_sig=death_sig) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 582, in close self.proc.send_signal(death_sig) File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1434, in send_signal raise ValueError("Unsupported signal: {}".format(sig)) ValueError: Unsupported signal: 2