AnswerDotAI / fsdp_qlora

Training LLMs with QLoRA + FSDP
Apache License 2.0
1.38k stars 185 forks source link

process 0 terminated with signal SIGKILL #47

Open hsb1995 opened 5 months ago

hsb1995 commented 5 months ago

I am interested your project. It is full of your work. But i met this bug for this project, please help me! @jph00 @johnowhitaker @KeremTurgutlu @warner-benjamin @geronimi73

World size: 2 Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 4.21MB/s]

Downloading data: 0%| | 0.00/44.3M [00:00<?, ?B/s] Downloading data: 24%|██▎ | 10.5M/44.3M [01:17<04:11, 135kB/s] Downloading data: 24%|██▎ | 10.5M/44.3M [01:30<04:11, 135kB/s] Downloading data: 47%|████▋ | 21.0M/44.3M [02:27<02:42, 144kB/s] Downloading data: 47%|████▋ | 21.0M/44.3M [02:40<02:42, 144kB/s] Downloading data: 71%|███████ | 31.5M/44.3M [03:37<01:27, 147kB/s] Downloading data: 71%|███████ | 31.5M/44.3M [03:50<01:27, 147kB/s] Downloading data: 95%|█████████▍| 41.9M/44.3M [04:32<00:14, 161kB/s] Downloading data: 95%|█████████▍| 41.9M/44.3M [04:50<00:14, 161kB/s] Downloading data: 100%|██████████| 44.3M/44.3M [05:12<00:00, 142kB/s] Generating train split: 51760 examples [00:00, 76513.36 examples/s] Creating model 0 Loading model 0 Loading & Quantizing Model Shards: 100%|██████████| 15/15 [30:58<00:00, 123.93s/it] Rank 0: Model created: 1.479 GiB trainable params: 744,488,960 || all params: 69,721,137,152 || trainable%: 1.0678095487411938 Wrapping model w/ FSDP 0 Rank 0: Wrapped model: 5.822 GiB Applying activation checkpointing 0 Total Training Steps: 12940 Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 969, in def main( File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse return _f() File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f return tfunc(**merge(args, args_from_prog(func, xtra))) File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 1042, in main mp.spawn(fsdp_main, File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Process finished with exit code 1

hsb1995 commented 5 months ago

Package Version


accelerate 0.29.1 aiohttp 3.9.3 aiosignal 1.3.1 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.0 black 24.3.0 Brotli 1.1.0 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.7 coloredlogs 15.0.1 datasets 2.18.0 decorator 5.1.1 dill 0.3.8 docker-pycreds 0.4.0 exceptiongroup 1.2.0 executing 2.0.1 fastcore 1.5.29 filelock 3.9.0 fire 0.6.0 frozenlist 1.4.1 fsspec 2024.2.0 gitdb 4.0.11 GitPython 3.1.43 hqq 0.1.6.post2 hqq-aten 0.0.0 huggingface-hub 0.22.2 humanfriendly 10.0 idna 3.4 inflate64 1.0.0 ipython 8.23.0 jedi 0.19.1 Jinja2 3.1.2 llama-recipes 0.0.1 loralib 0.1.2 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 multivolumefile 0.2.3 mypy-extensions 1.0.0 networkx 3.2.1 numpy 1.26.3 nvidia-cublas-cu11 11.11.3.6 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 8.7.0.84 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.3.0.86 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusparse-cu11 11.7.5.86 nvidia-nccl-cu11 2.19.3 nvidia-nvtx-cu11 11.8.86 optimum 1.18.0 packaging 24.0 pandas 2.2.1 parso 0.8.4 pathspec 0.12.1 peft 0.10.0 pexpect 4.9.0 pillow 10.2.0 pip 23.3.1 platformdirs 4.2.0 prompt-toolkit 3.0.43 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py7zr 0.21.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pybcj 1.0.2 pycryptodomex 3.20.0 Pygments 2.17.2 pyppmd 1.1.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 pyzstd 0.15.10 regex 2023.12.25 requests 2.28.1 safetensors 0.4.2 scipy 1.13.0 sentencepiece 0.2.0 sentry-sdk 1.44.1 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smmap 5.0.1 stack-data 0.6.3 sympy 1.12 termcolor 2.4.0 texttable 1.7.0 timm 0.9.16 tokenize-rt 5.2.0 tokenizers 0.15.2 tomli 2.0.1 torch 2.2.0+cu118 torchaudio 2.2.0+cu118 torchvision 0.17.0+cu118 tqdm 4.66.2 traitlets 5.14.2 transformers 4.39.3 triton 2.2.0 typing_extensions 4.8.0 tzdata 2024.1 urllib3 1.26.13 wandb 0.16.6 wcwidth 0.2.13 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4

hsb1995 commented 5 months ago

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3630 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 4142333 C .../sam/anaconda3/envs/fsdp/bin/python 3990MiB | | 1 N/A N/A 3630 G /usr/lib/xorg/Xorg 243MiB | | 1 N/A N/A 3758 G /usr/bin/gnome-shell 9MiB | | 1 N/A N/A 3249081 G /usr/libexec/gnome-shell-portal-helper 4MiB | | 1 N/A N/A 4142334 C .../sam/anaconda3/envs/fsdp/bin/python 3968MiB | +---------------------------------------------------------------------------------------+ I can confirm that when I load the "Loading&Quantizing Model Shards" step, it is indeed multitasking in parallel. But after loading the node, the message "process 0 terminated with signal SIGKILL" appears

hsb1995 commented 5 months ago

image My small weight can be calculated, but when it comes to large weight, there is a problem.

hsb1995 commented 5 months ago

image My code runs on dual 3090,Please ask the author to help take a look.