flowersteam / lamorel

Lamorel is a Python library designed for RL practitioners eager to use Large Language Models (LLMs).
MIT License
193 stars 18 forks source link

Device 0 is not recognized #24

Closed giobin closed 11 months ago

giobin commented 11 months ago

Hello! First of all, very nice work!

I have an issue with running the example PPO_finetuning. It seems that it doesn't recognize the GPU device.

I'm runnignon this setup:

Screenshot 2023-11-17 alle 18 28 05

My command is the folowing: python -m lamorel_launcher.launch --config-path /data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/ --config-name local_gpu_config rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments lamorel_args.accelerate_args.machine_rank=0 lamorel_args.llm_args.model_path=t5-small and this is the Error:

/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel_launcher/launch.py:15: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='', config_name='')
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[2023-11-17 18:22:01,325][root][INFO] - Using nproc_per_node=2.
[2023-11-17 18:22:01,325][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py:150: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='config', config_name='config')
/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py:150: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='config', config_name='config')
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[2023-11-17 18:22:03,085][accelerate.utils.other][WARNING] - Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2023-11-17 18:22:03,085][lamorel_logger][INFO] - Init rl group for process 0
[2023-11-17 18:22:03,087][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2023-11-17 18:22:03,361][lamorel_logger][INFO] - Init rl group for process 1
[2023-11-17 18:22:03,361][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 1
[2023-11-17 18:22:03,361][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
[2023-11-17 18:22:03,361][lamorel_logger][INFO] - Init llm group for process 1
[2023-11-17 18:22:03,362][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
[2023-11-17 18:22:03,362][lamorel_logger][INFO] - Init llm group for process 0
[2023-11-17 18:22:03,362][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:3 to store for rank: 0
[2023-11-17 18:22:03,363][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:3 to store for rank: 1
[2023-11-17 18:22:03,363][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
[2023-11-17 18:22:03,363][lamorel_logger][INFO] - Init rl-llm group for process 1
[2023-11-17 18:22:03,373][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
[2023-11-17 18:22:03,373][lamorel_logger][INFO] - Init rl-llm group for process 0
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 1
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 0
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-17 18:22:03,385][lamorel_logger][INFO] - 2 gpus available for current LLM but using only model_parallelism_size = 1
[2023-11-17 18:22:03,385][lamorel_logger][INFO] - Devices on process 1 (index 0): [0]
Parallelising HF LLM on 1 devices
Loading model t5-small
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/gym/utils/passive_env_checker.py:165: UserWarning: WARN: The obs returned by the `reset()` method is not within the observation space.
  logger.warn(f"{pre} is not within the observation space.")
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/gym/utils/passive_env_checker.py:133: UserWarning: WARN: The obs returned by the `reset()` method should be an int or np.int64, actual type: <class 'str'>
  logger.warn(f"{pre} should be an int or np.int64, actual type: {type(obs)}")
Error executing job with overrides: ['rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py', 'rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments', 'lamorel_args.accelerate_args.machine_rank=0', 'lamorel_args.llm_args.model_path=t5-small']
Traceback (most recent call last):
  File "/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py", line 164, in main
    lm_server = Caller(config_args.lamorel_args,
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/caller.py", line 53, in __init__
    Server(
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/server/server.py", line 40, in __init__
    self._model = HF_LLM(config.llm_args, devices, use_cpu)
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 38, in __init__
    device_map = infer_auto_device_map(
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 923, in infer_auto_device_map
    max_memory = get_max_memory(max_memory)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 674, in get_max_memory
    raise ValueError(
ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk''

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py', 'rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments', 'lamorel_args.accelerate_args.machine_rank=0', 'lamorel_args.llm_args.model_path=t5-small']
Traceback (most recent call last):
  File "/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py", line 199, in main
    output = lm_server.custom_module_fns(['score', 'value'],
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/caller.py", line 95, in custom_module_fns
    return self.__call_model(InstructionsEnum.FORWARD, True, module_function_keys=module_function_keys,
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/caller.py", line 99, in __call_model
    dist.gather_object(
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1758, in gather_object
    all_gather(object_size_list, local_size, group=group)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2075, in all_gather
    work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1659484809662/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:15580

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2023-11-17 18:22:06,348][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 71611) of binary: /home/gbonetta/miniconda3/envs/lamorel_env/bin/python
Error executing job with overrides: ['rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py', 'rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments', 'lamorel_args.accelerate_args.machine_rank=0', 'lamorel_args.llm_args.model_path=t5-small']
Traceback (most recent call last):
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main
    launch_command(accelerate_args)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-17_18:22:06
  host      : hltnlp-gpu-a
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 71612)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-17_18:22:06
  host      : hltnlp-gpu-a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 71611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

my conda env contains the following packages:

conda list
# packages in environment at /home/gbonetta/miniconda3/envs/lamorel_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   2.0.0                    pypi_0    pypi
accelerate                0.24.1                   pypi_0    pypi
aiohttp                   3.8.6                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
annotated-types           0.6.0                    pypi_0    pypi
antlr4-python3-runtime    4.9.3                    pypi_0    pypi
anyio                     3.7.1                    pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
asttokens                 2.4.1                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
attrs                     23.1.0                   pypi_0    pypi
babyai                    0.1.0                     dev_0    <develop>
babyai-text               0.1.0                     dev_0    <develop>
blas                      1.0                         mkl  
blosc                     1.11.1                   pypi_0    pypi
brotli-python             1.0.9            py39h6a678d5_7  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.08.22           h06a4308_0  
cachetools                5.3.2                    pypi_0    pypi
certifi                   2023.7.22        py39h06a4308_0  
cffi                      1.15.1           py39h5eee18b_3  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
click                     8.1.7                    pypi_0    pypi
cloudpickle               3.0.0                    pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
comm                      0.2.0                    pypi_0    pypi
contourpy                 1.2.0                    pypi_0    pypi
cryptography              41.0.3           py39hdda0065_0  
cudatoolkit               11.3.1               h2bc3f7f_2  
cycler                    0.12.1                   pypi_0    pypi
datasets                  2.15.0                   pypi_0    pypi
debugpy                   1.8.0                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
dill                      0.3.7                    pypi_0    pypi
distro                    1.8.0                    pypi_0    pypi
docker-pycreds            0.4.0                    pypi_0    pypi
exceptiongroup            1.1.3                    pypi_0    pypi
executing                 2.0.1                    pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.13.1                   pypi_0    pypi
fonttools                 4.44.3                   pypi_0    pypi
freetype                  2.12.1               h4a9f257_0  
frozenlist                1.4.0                    pypi_0    pypi
fsspec                    2023.10.0                pypi_0    pypi
giflib                    5.2.1                h5eee18b_3  
gitdb                     4.0.11                   pypi_0    pypi
gitpython                 3.1.40                   pypi_0    pypi
gmp                       6.2.1                h295c915_3  
gnutls                    3.6.15               he1e5248_0  
google-auth               2.23.4                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.59.2                   pypi_0    pypi
gym                       0.26.1                   pypi_0    pypi
gym-minigrid              1.0.1                     dev_0    <develop>
gym-notices               0.0.8                    pypi_0    pypi
h11                       0.14.0                   pypi_0    pypi
httpcore                  1.0.2                    pypi_0    pypi
httpx                     0.25.1                   pypi_0    pypi
huggingface-hub           0.19.3                   pypi_0    pypi
hydra-core                1.3.2                    pypi_0    pypi
idna                      3.4              py39h06a4308_0  
imageio                   2.32.0                   pypi_0    pypi
importlib-metadata        6.8.0                    pypi_0    pypi
importlib-resources       6.1.1                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306  
ipykernel                 6.26.0                   pypi_0    pypi
ipython                   8.17.2                   pypi_0    pypi
jedi                      0.19.1                   pypi_0    pypi
jpeg                      9e                   h5eee18b_1  
jupyter-client            8.6.0                    pypi_0    pypi
jupyter-core              5.5.0                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
lame                      3.100                h7b6447c_0  
lamorel                   0.1                       dev_0    <develop>
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libdeflate                1.17                 h5eee18b_1  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.4                h5eee18b_0  
libpng                    1.6.39               h5eee18b_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.5.1                h6a678d5_0  
libunistring              0.9.10               h27cfd23_0  
libwebp                   1.3.2                h11a3e52_0  
libwebp-base              1.3.2                h5eee18b_0  
lz4-c                     1.9.4                h6a678d5_0  
markdown                  3.5.1                    pypi_0    pypi
markupsafe                2.1.3                    pypi_0    pypi
matplotlib                3.8.1                    pypi_0    pypi
matplotlib-inline         0.1.6                    pypi_0    pypi
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0            py39h5eee18b_1  
mkl_fft                   1.3.8            py39h5eee18b_0  
mkl_random                1.2.4            py39hdb19cb5_0  
multidict                 6.0.4                    pypi_0    pypi
multiprocess              0.70.15                  pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
nest-asyncio              1.5.8                    pypi_0    pypi
nettle                    3.7.3                hbbd107a_1  
numpy                     1.26.0           py39h5f9d8c6_0  
numpy-base                1.26.0           py39hb5e798b_0  
oauthlib                  3.2.2                    pypi_0    pypi
omegaconf                 2.3.0                    pypi_0    pypi
openai                    1.3.0                    pypi_0    pypi
openh264                  2.1.1                h4ff587b_0  
openjpeg                  2.4.0                h3ad879b_0  
openssl                   3.0.12               h7f8727e_0  
packaging                 23.2                     pypi_0    pypi
pandas                    2.1.3                    pypi_0    pypi
parso                     0.8.3                    pypi_0    pypi
pexpect                   4.8.0                    pypi_0    pypi
pillow                    10.0.1           py39ha6cbd5a_0  
pip                       23.3             py39h06a4308_0  
platformdirs              4.0.0                    pypi_0    pypi
prompt-toolkit            3.0.41                   pypi_0    pypi
protobuf                  3.20.3                   pypi_0    pypi
psutil                    5.9.6                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pyarrow                   14.0.1                   pypi_0    pypi
pyarrow-hotfix            0.5                      pypi_0    pypi
pyasn1                    0.5.0                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pydantic                  2.5.1                    pypi_0    pypi
pydantic-core             2.14.3                   pypi_0    pypi
pygments                  2.16.1                   pypi_0    pypi
pyopenssl                 23.2.0           py39h06a4308_0  
pyparsing                 3.1.1                    pypi_0    pypi
pysocks                   1.7.1            py39h06a4308_0  
python                    3.9.18               h955ad1f_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytorch                   1.12.1          py3.9_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.3.post1             pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
pyzmq                     25.1.1                   pypi_0    pypi
readline                  8.2                  h5eee18b_0  
regex                     2023.10.3                pypi_0    pypi
requests                  2.31.0           py39h06a4308_0  
requests-oauthlib         1.3.1                    pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
safetensors               0.4.0                    pypi_0    pypi
scipy                     1.11.3                   pypi_0    pypi
sentencepiece             0.1.99                   pypi_0    pypi
sentry-sdk                1.35.0                   pypi_0    pypi
setproctitle              1.3.3                    pypi_0    pypi
setuptools                68.0.0           py39h06a4308_0  
six                       1.16.0                   pypi_0    pypi
smmap                     5.0.1                    pypi_0    pypi
sniffio                   1.3.0                    pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0  
stack-data                0.6.3                    pypi_0    pypi
tbb                       2021.8.0             hdb19cb5_0  
tensorboard               2.7.0                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tensorboardx              1.8                      pypi_0    pypi
termcolor                 2.3.0                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.15.0                   pypi_0    pypi
torchaudio                0.12.1               py39_cu113    pytorch
torchvision               0.13.1               py39_cu113    pytorch
tornado                   6.3.3                    pypi_0    pypi
tqdm                      4.64.0                   pypi_0    pypi
traitlets                 5.13.0                   pypi_0    pypi
transformers              4.35.2                   pypi_0    pypi
typing-extensions         4.8.0                    pypi_0    pypi
typing_extensions         4.7.1            py39h06a4308_0  
tzdata                    2023.3                   pypi_0    pypi
urllib3                   1.26.18          py39h06a4308_0  
wandb                     0.16.0                   pypi_0    pypi
wcwidth                   0.2.10                   pypi_0    pypi
werkzeug                  3.0.1                    pypi_0    pypi
wheel                     0.41.2           py39h06a4308_0  
xxhash                    3.4.1                    pypi_0    pypi
xz                        5.4.2                h5eee18b_0  
yarl                      1.9.2                    pypi_0    pypi
zipp                      3.17.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0  

While my pip shows the following:

pip list
Package                 Version      Editable project location
----------------------- ------------ ------------------------------------------------------------------------------------------
absl-py                 2.0.0
accelerate              0.24.1
aiohttp                 3.8.6
aiosignal               1.3.1
annotated-types         0.6.0
antlr4-python3-runtime  4.9.3
anyio                   3.7.1
appdirs                 1.4.4
asttokens               2.4.1
async-timeout           4.0.3
attrs                   23.1.0
babyai                  0.1.0        /data/disk1/share/gbonetta/progetti/Grounding_LLMs_with_online_RL/babyai-text/babyai
babyai-text             0.1.0        /data/disk1/share/gbonetta/progetti/Grounding_LLMs_with_online_RL/babyai-text
blosc                   1.11.1
Brotli                  1.0.9
cachetools              5.3.2
certifi                 2023.7.22
cffi                    1.15.1
charset-normalizer      2.0.4
click                   8.1.7
cloudpickle             3.0.0
colorama                0.4.6
comm                    0.2.0
contourpy               1.2.0
cryptography            41.0.3
cycler                  0.12.1
datasets                2.15.0
debugpy                 1.8.0
decorator               5.1.1
dill                    0.3.7
distro                  1.8.0
docker-pycreds          0.4.0
exceptiongroup          1.1.3
executing               2.0.1
filelock                3.13.1
fonttools               4.44.3
frozenlist              1.4.0
fsspec                  2023.10.0
gitdb                   4.0.11
GitPython               3.1.40
google-auth             2.23.4
google-auth-oauthlib    0.4.6
grpcio                  1.59.2
gym                     0.26.1
gym-minigrid            1.0.1        /data/disk1/share/gbonetta/progetti/Grounding_LLMs_with_online_RL/babyai-text/gym-minigrid
gym-notices             0.0.8
h11                     0.14.0
httpcore                1.0.2
httpx                   0.25.1
huggingface-hub         0.19.3
hydra-core              1.3.2
idna                    3.4
imageio                 2.32.0
importlib-metadata      6.8.0
importlib-resources     6.1.1
ipykernel               6.26.0
ipython                 8.17.2
jedi                    0.19.1
jupyter_client          8.6.0
jupyter_core            5.5.0
kiwisolver              1.4.5
lamorel                 0.1          /data/disk1/share/gbonetta/progetti/lamorel/lamorel/src
Markdown                3.5.1
MarkupSafe              2.1.3
matplotlib              3.8.1
matplotlib-inline       0.1.6
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.4
multiprocess            0.70.15
nest-asyncio            1.5.8
numpy                   1.26.0
oauthlib                3.2.2
omegaconf               2.3.0
openai                  1.3.0
packaging               23.2
pandas                  2.1.3
parso                   0.8.3
pexpect                 4.8.0
Pillow                  10.0.1
pip                     23.3
platformdirs            4.0.0
prompt-toolkit          3.0.41
protobuf                3.20.3
psutil                  5.9.6
ptyprocess              0.7.0
pure-eval               0.2.2
pyarrow                 14.0.1
pyarrow-hotfix          0.5
pyasn1                  0.5.0
pyasn1-modules          0.3.0
pycparser               2.21
pydantic                2.5.1
pydantic_core           2.14.3
Pygments                2.16.1
pyOpenSSL               23.2.0
pyparsing               3.1.1
PySocks                 1.7.1
python-dateutil         2.8.2
pytz                    2023.3.post1
PyYAML                  6.0.1
pyzmq                   25.1.1
regex                   2023.10.3
requests                2.31.0
requests-oauthlib       1.3.1
rsa                     4.9
safetensors             0.4.0
scipy                   1.11.3
sentencepiece           0.1.99
sentry-sdk              1.35.0
setproctitle            1.3.3
setuptools              68.0.0
six                     1.16.0
smmap                   5.0.1
sniffio                 1.3.0
stack-data              0.6.3
tensorboard             2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
tensorboardX            1.8
termcolor               2.3.0
tokenizers              0.15.0
torch                   1.12.1
torchaudio              0.12.1
torchvision             0.13.1
tornado                 6.3.3
tqdm                    4.64.0
traitlets               5.13.0
transformers            4.35.2
typing_extensions       4.7.1
tzdata                  2023.3
urllib3                 1.26.18
wandb                   0.16.0
wcwidth                 0.2.10
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.2
zipp                    3.17.0

and I am using python 3.9.18.

The configuration i am using in local_gpu_config.yaml: lamorel_args: log_level: info allow_subgraph_use_whith_gradient: false distributed_setup_args: n_rl_processes: 1 n_llm_processes: 1 accelerate_args: config_file: ../configs/accelerate/default_config.yaml machine_rank: 0 main_process_ip: 127.0.0.1 num_machines: 1 llm_args: model_type: seq2seq model_path: t5-small pretrained: true minibatch_size: 192 pre_encode_inputs: true parallelism: use_gpu: true model_parallelism_size: 1 synchronize_gpus_after_scoring: false empty_cuda_cache_after_scoring: false rl_script_args: path: ??? name_environment: 'BabyAI-GoToRedBall-v0' epochs: 2 steps_per_epoch: 128 minibatch_size: 64 gradient_batch_size: 16 ppo_epochs: 4 lam: 0.99 gamma: 0.99 target_kl: 0.01 max_ep_len: 1000 lr: 1e-4 entropy_coef: 0.01 value_loss_coef: 0.5 clip_eps: 0.2 max_grad_norm: 0.5 save_freq: 100 output_dir: ??? but anyway it seems irrelevant if i change the machine_rank.

Do you have some suggestion on what might be happening? Thank you!

ClementRomac commented 11 months ago

Hi,

It seems your pytorch version is pretty old. Could you try upgrading it? I will update the dependencies in setup.py.

ewanlee commented 11 months ago

Hi,

It seems your pytorch version is pretty old. Could you try upgrading it? I will update the dependencies in setup.py.

Hello! Thank you very much for open-sourcing this project, it has been extremely helpful for me!

I encountered the same issue: ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'. My PyTorch version is 2.1.1, and the CUDA version is 11.8.

In addition, when I directly import accelerate in IPython and run accelerate.utils.get_max_memory(), I can get normal return results.

image

Is it possible that there is a strange conflict with the accelerate package during execution?

ClementRomac commented 11 months ago

Hi,

I managed to reproduce it locally and fixed it in this PR. Please let me know if the PR also works for you before I merge it to the main branch.

giobin commented 11 months ago

Hi,

I tried it out and it works now! Thanks

ClementRomac commented 11 months ago

Awesome, merging the PR and closing the issue!

ewanlee commented 11 months ago

Hi,

I managed to reproduce it locally and fixed it in this PR. Please let me know if the PR also works for you before I merge it to the main branch.

This also works for me! Thank you very much :)