Closed giobin closed 11 months ago
Hi,
It seems your pytorch version is pretty old. Could you try upgrading it? I will update the dependencies in setup.py.
Hi,
It seems your pytorch version is pretty old. Could you try upgrading it? I will update the dependencies in setup.py.
Hello! Thank you very much for open-sourcing this project, it has been extremely helpful for me!
I encountered the same issue: ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'
. My PyTorch version is 2.1.1, and the CUDA version is 11.8.
In addition, when I directly import accelerate
in IPython and run accelerate.utils.get_max_memory()
, I can get normal return results.
Is it possible that there is a strange conflict with the accelerate
package during execution?
Hi,
I managed to reproduce it locally and fixed it in this PR. Please let me know if the PR also works for you before I merge it to the main branch.
Hi,
I tried it out and it works now! Thanks
Awesome, merging the PR and closing the issue!
Hello! First of all, very nice work!
I have an issue with running the example PPO_finetuning. It seems that it doesn't recognize the GPU device.
I'm runnignon this setup:
My command is the folowing:
python -m lamorel_launcher.launch --config-path /data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/ --config-name local_gpu_config rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments lamorel_args.accelerate_args.machine_rank=0 lamorel_args.llm_args.model_path=t5-small
and this is the Error:my conda env contains the following packages:
While my pip shows the following:
and I am using python 3.9.18.
The configuration i am using in local_gpu_config.yaml:
lamorel_args: log_level: info allow_subgraph_use_whith_gradient: false distributed_setup_args: n_rl_processes: 1 n_llm_processes: 1 accelerate_args: config_file: ../configs/accelerate/default_config.yaml machine_rank: 0 main_process_ip: 127.0.0.1 num_machines: 1 llm_args: model_type: seq2seq model_path: t5-small pretrained: true minibatch_size: 192 pre_encode_inputs: true parallelism: use_gpu: true model_parallelism_size: 1 synchronize_gpus_after_scoring: false empty_cuda_cache_after_scoring: false rl_script_args: path: ??? name_environment: 'BabyAI-GoToRedBall-v0' epochs: 2 steps_per_epoch: 128 minibatch_size: 64 gradient_batch_size: 16 ppo_epochs: 4 lam: 0.99 gamma: 0.99 target_kl: 0.01 max_ep_len: 1000 lr: 1e-4 entropy_coef: 0.01 value_loss_coef: 0.5 clip_eps: 0.2 max_grad_norm: 0.5 save_freq: 100 output_dir: ???
but anyway it seems irrelevant if i change the machine_rank.Do you have some suggestion on what might be happening? Thank you!