Closed Zeyuan-Liu closed 2 months ago
Environment:
Python 3.10.14
Package Version
accelerate 0.29.3 aiohttp 3.9.5 aiohttp-cors 0.7.0 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.2.0 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.0.5 async-lru 2.0.4 async-timeout 4.0.3 attrs 23.1.0 Babel 2.11.0 beautifulsoup4 4.12.2 bitsandbytes 0.43.1 bleach 4.1.0 Brotli 1.0.9 cachetools 5.3.3 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 2.0.4 click 8.1.7 coloredlogs 15.0.1 colorful 0.5.6 comm 0.2.1 datasets 2.19.0 debugpy 1.6.7 decorator 5.1.1 deepspeed 0.13.2 defusedxml 0.7.1 dill 0.3.8 distlib 0.3.8 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 executing 0.8.3 fastjsonschema 2.16.2 filelock 3.13.1 flash-attn 2.4.2 frozenlist 1.4.1 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 gmpy2 2.1.2 google-api-core 2.18.0 google-auth 2.29.0 googleapis-common-protos 1.63.0 grpcio 1.62.2 hjson 3.1.0 huggingface-hub 0.22.2 humanfriendly 10.0 idna 3.4 ipykernel 6.28.0 ipython 8.20.0 ipywidgets 8.1.2 isort 5.13.2 jedi 0.18.1 Jinja2 3.1.3 json5 0.9.6 jsonlines 4.0.0 jsonschema 4.19.2 jsonschema-specifications 2023.12.1 jupyter 1.0.0 jupyter_client 8.6.0 jupyter-console 6.6.3 jupyter_core 5.5.0 jupyter-events 0.8.0 jupyter-lsp 2.2.0 jupyter_server 2.10.0 jupyter_server_terminals 0.4.4 jupyterlab 4.0.11 jupyterlab-pygments 0.1.2 jupyterlab_server 2.25.1 jupyterlab-widgets 3.0.10 lightning-utilities 0.11.2 linkify-it-py 2.0.3 loralib 0.1.2 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 mdit-py-plugins 0.4.0 mdurl 0.1.2 memray 1.12.0 mistune 2.0.4 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 mpi4py 3.1.4 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nbclient 0.8.0 nbconvert 7.10.0 nbformat 5.9.2 nest-asyncio 1.6.0 networkx 3.1 ninja 1.11.1.1 notebook 7.0.8 notebook_shim 0.2.3 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 opencensus 0.11.4 opencensus-context 0.1.3 openrlhf 0.2.6 optimum 1.19.1 overrides 7.4.0 packaging 23.2 pandas 2.2.2 pandocfilters 1.5.0 parso 0.8.3 peft 0.10.0 pexpect 4.8.0 pillow 10.2.0 pip 23.3.1 platformdirs 3.10.0 ply 3.11 prometheus-client 0.14.1 prompt-toolkit 3.0.43 proto-plus 1.23.0 protobuf 4.25.3 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 py-spy 0.3.14 pyarrow 16.0.0 pyarrow-hotfix 0.6 pyasn1 0.6.0 pyasn1_modules 0.4.0 pycparser 2.21 pydantic 2.7.1 pydantic_core 2.18.2 Pygments 2.15.1 pynvml 11.5.0 PyQt5 5.15.10 PyQt5-sip 12.13.0 PySocks 1.7.1 python-dateutil 2.8.2 python-json-logger 2.0.7 pytz 2024.1 PyYAML 6.0.1 pyzmq 25.1.2 qtconsole 5.5.1 QtPy 2.4.1 ray 2.12.0 referencing 0.35.0 regex 2024.4.16 requests 2.31.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 rpds-py 0.10.6 rsa 4.9 safetensors 0.4.3 Send2Trash 1.8.2 sentencepiece 0.2.0 sentry-sdk 2.0.1 setproctitle 1.3.3 setuptools 68.2.2 sip 6.7.12 six 1.16.0 smart-open 7.0.4 smmap 5.0.1 sniffio 1.3.0 soupsieve 2.5 stack-data 0.2.0 sympy 1.12 terminado 0.17.1 textual 0.58.0 tinycss2 1.2.1 tokenizers 0.15.2 tomli 2.0.1 torch 2.3.0 torchaudio 2.3.0 torchmetrics 1.3.2 torchvision 0.18.0 tornado 6.3.3 tqdm 4.66.2 traitlets 5.7.1 transformers 4.38.2 transformers-stream-generator 0.0.5 triton 2.3.0 typing_extensions 4.11.0 tzdata 2024.1 uc-micro-py 1.0.3 urllib3 2.1.0 virtualenv 20.26.0 wandb 0.16.6 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 0.58.0 wheel 0.41.3 widgetsnbextension 4.0.10 wrapt 1.16.0 xxhash 3.4.1 yarl 1.9.4
could you try the docker container?
could you try the docker container? When I try to run
bash docker_run.sh build
, the following issues happens:
docker build -t nvcr.io/nvidia/pytorch:23.12-py3 /home/wangyx/lzy/rlhf/OpenRLHF-main/dockerfile [+] Building 0.7s (2/2) FINISHED docker:default => [internal] load build definition from Dockerfile 0.1s => => transferring dockerfile: 701B 0.0s => ERROR [internal] load metadata for nvcr.io/nvidia/pytorch:23.12-py3 0.5s
[internal] load metadata for nvcr.io/nvidia/pytorch:23.12-py3:
Dockerfile:1
1 | >>> FROM nvcr.io/nvidia/pytorch:23.12-py3
2 |
3 | WORKDIR /app
ERROR: failed to solve: nvcr.io/nvidia/pytorch:23.12-py3: failed to resolve source metadata for nvcr.io/nvidia/pytorch:23.12-py3: failed to do request: Head "https://nvcr.io/v2/nvidia/pytorch/manifests/23.12-py3": dial tcp 54.148.47.228:443: connect: network is unreachable
network issue, can you access nvcr.io/nvidia/pytorch:23.12-py3 ?
When I click on the aforementioned link nvcr.io/nvidia/pytorch:23.12-py3, I can get access to
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags
but can not open
What happened + What you expected to happen:
Operation process:
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
Success start head:
My Configuration
Error Information