OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

train_ppo_llama_ray.sh run two H800 machine error #318

Closed yangzhipeng1108 closed 3 weeks ago

yangzhipeng1108 commented 3 weeks ago

At least one of the input arguments for this task could not be computed: ray.exceptions.OwnerDiedError: Failed to retrieve object 004553850c97129b58c533c101cb5c1bc4de6d930200000002e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and ray.init().

The object's owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*4fe82a45e0c8ef9803c3c57b6583ae52de04fd6c5da6abc6f49a8bd9* at IP address 0.0.0.0) for more information about the Python worker failure.

image

hijkzzz commented 3 weeks ago

Do you use the container https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile?

yangzhipeng1108 commented 3 weeks ago

Package Version


absl-py 2.0.0 accelerate 0.30.1 aiohttp 3.9.1 aiohttp-cors 0.7.0 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.4.0 apex 0.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.1.0 audioread 3.0.1 beautifulsoup4 4.12.2 bitsandbytes 0.43.1 bleach 6.1.0 blis 0.7.11 cachetools 5.3.2 catalogue 2.0.10 certifi 2023.11.17 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 cmake 3.27.9 coloredlogs 15.0.1 colorful 0.5.6 comm 0.2.0 confection 0.1.4 contourpy 1.2.0 cubinlinker 0.3.0+2.gbde7348 cuda-python 12.3.0rc4+8.gcb4e395 cudf 23.10.0 cugraph 23.10.0 cugraph-dgl 23.10.0 cugraph-service-client 23.10.0 cugraph-service-server 23.10.0 cuml 23.10.0 cupy-cuda12x 12.2.0 cycler 0.12.1 cymem 2.0.8 Cython 3.0.6 dask 2023.9.2 dask-cuda 23.10.0 dask-cudf 23.10.0 datasets 2.19.2 debugpy 1.8.0 decorator 5.1.1 deepspeed 0.13.5 defusedxml 0.7.1 dill 0.3.8 diskcache 5.6.3 distlib 0.3.8 distributed 2023.9.2 distro 1.9.0 dm-tree 0.1.8 dnspython 2.6.1 docker-pycreds 0.4.0 einops 0.7.0 email_validator 2.1.1 exceptiongroup 1.2.0 execnet 2.0.2 executing 2.0.1 expecttest 0.1.3 fastapi 0.111.0 fastapi-cli 0.0.4 fastjsonschema 2.19.0 fastrlock 0.8.2 filelock 3.13.1 flash-attn 2.5.8 fonttools 4.46.0 frozenlist 1.4.0 fsspec 2023.12.0 gast 0.5.4 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.19.0 google-auth 2.25.0 google-auth-oauthlib 0.4.6 googleapis-common-protos 1.63.1 graphsurgeon 0.4.6 grpcio 1.59.3 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.3 humanfriendly 10.0 hypothesis 5.35.1 idna 3.6 importlib-metadata 7.0.0 iniconfig 2.0.0 intel-openmp 2021.4.0 interegular 0.3.3 ipykernel 6.27.1 ipython 8.18.1 ipython-genutils 0.2.0 isort 5.13.2 jedi 0.19.1 Jinja2 3.1.2 joblib 1.3.2 json5 0.9.14 jsonlines 4.0.0 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 jupyter_client 8.6.0 jupyter_core 5.5.0 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab_pygments 0.3.0 jupyterlab-server 1.2.0 jupytext 1.16.0 kiwisolver 1.4.5 langcodes 3.3.0 lark 1.1.9 lazy_loader 0.3 librosa 0.10.1 lightning-utilities 0.11.2 linkify-it-py 2.0.3 llvmlite 0.40.1 lm-format-enforcer 0.9.8 locket 1.0.0 loralib 0.1.2 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.4.0 mdurl 0.1.2 memray 1.12.0 mistune 3.0.2 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 5.1.0 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.16 murmurhash 1.0.10 nbclient 0.9.0 nbconvert 7.12.0 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 2.6.3 ninja 1.11.1.1 notebook 6.4.10 numba 0.57.1+1.g4157f3379 numpy 1.24.4 nvfuser 0.1.1+gitunknown nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-dali-cuda120 1.32.0 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.40 nvidia-nvtx-cu12 12.1.105 nvidia-pyindex 1.0.9 nvtx 0.2.5 oauthlib 3.2.2 onnx 1.15.0rc2 openai 1.31.1 opencensus 0.11.4 opencensus-context 0.1.3 opencv 4.7.0 openrlhf 0.2.9 optimum 1.20.0 optree 0.10.0 orjson 3.10.3 outlines 0.0.34 packaging 23.2 pandas 1.5.3 pandocfilters 1.5.0 parso 0.8.3 partd 1.4.1 peft 0.11.1 pexpect 4.9.0 Pillow 9.5.0 pip 23.3.1 platformdirs 4.1.0 pluggy 1.3.0 ply 3.11 polygraphy 0.49.1 pooch 1.8.0 preshed 3.0.9 prettytable 3.9.0 prometheus-client 0.19.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.41 proto-plus 1.23.0 protobuf 4.24.4 psutil 5.9.4 ptxcompiler 0.8.1+2.g5ad1474 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 py-spy 0.3.14 pyarrow 12.0.1 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pybind11 2.11.1 pybind11-global 2.11.1 pycocotools 2.0+nv0.8.0 pycparser 2.21 pydantic 2.5.2 pydantic_core 2.14.5 Pygments 2.17.2 pylibcugraph 23.10.0 pylibcugraphops 23.10.0 pylibraft 23.10.0 pynvml 11.4.1 pyparsing 3.1.1 pytest 7.4.3 pytest-flakefinder 1.1.0 pytest-rerunfailures 13.0 pytest-shard 0.1.2 pytest-xdist 3.5.0 python-dateutil 2.8.2 python-dotenv 1.0.1 python-hostlist 1.23.0 python-multipart 0.0.9 pytorch-quantization 2.1.2 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.2 raft-dask 23.10.0 ray 2.23.0 referencing 0.31.1 regex 2023.10.3 requests 2.32.3 requests-oauthlib 1.3.1 rich 13.7.0 rmm 23.10.0 rpds-py 0.13.2 rsa 4.9 safetensors 0.4.3 scikit-learn 1.2.0 scipy 1.11.4 Send2Trash 1.8.2 sentencepiece 0.2.0 sentry-sdk 2.4.0 setproctitle 1.3.3 setuptools 68.2.2 shellingham 1.5.4 six 1.16.0 smart-open 6.4.0 smmap 5.0.1 sniffio 1.3.1 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.5 soxr 0.3.7 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 sphinx-glpi-theme 0.4.1 srsly 2.4.8 stack-data 0.6.3 starlette 0.37.2 sympy 1.12 tabulate 0.9.0 tbb 2021.11.0 tblib 3.0.0 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 8.6.1 terminado 0.18.0 textual 0.65.1 thinc 8.2.1 threadpoolctl 3.2.0 thriftpy2 0.4.17 tiktoken 0.6.0 tinycss2 1.2.1 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.3.0 torch-tensorrt 2.2.0a0 torchdata 0.7.0a0 torchmetrics 1.4.0.post0 torchtext 0.17.0a0 torchvision 0.17.0a0 tornado 6.4 tqdm 4.66.1 traitlets 5.9.0 transformers 4.41.2 transformers-stream-generator 0.0.5 treelite 3.9.1 treelite-runtime 3.9.1 triton 2.3.0 typer 0.12.3 types-dataclasses 0.6.6 typing_extensions 4.8.0 uc-micro-py 1.0.3 ucx-py 0.34.0 uff 0.6.9 ujson 5.10.0 urllib3 1.26.18 uvicorn 0.30.1 uvloop 0.19.0 virtualenv 20.26.2 vllm 0.4.2 vllm-nccl-cu12 2.18.1.0.4.0 wandb 0.17.0 wasabi 1.1.2 watchfiles 0.22.0 wcwidth 0.2.12 weasel 0.3.4 webencodings 0.5.1 websockets 12.0 Werkzeug 3.0.1 wheel 0.42.0 xdoctest 1.0.2 xformers 0.0.26.post1 xxhash 3.4.1 yarl 1.9.3 zict 3.0.0 zipp 3.17.0

yangzhipeng1108 commented 3 weeks ago

Do you use the container https://github.com/OpenLLMAI/OpenRLHF/tree/main/docker

use this dockerfile