Open Josh00-Lu opened 1 year ago
I have solved the problem. But I encountered another :( Now, I only use
python -m lamorel_launcher.launch --config-path Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/configs --config-name local_gpu_config rl_script_args.path=Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py lamorel_args.accelerate_args.machine_rank=0
As you mentioned in https://github.com/flowersteam/lamorel/issues/23#issuecomment-1790381781
With the local_gpu_config.yaml
as:
lamorel_args:
log_level: info
allow_subgraph_use_whith_gradient: true
distributed_setup_args:
n_rl_processes: 1
n_llm_processes: 1
accelerate_args:
config_file: accelerate/default_config.yaml
machine_rank: 0
num_machines: 1
llm_args:
model_type: seq2seq
model_path: t5-small
pretrained: true
minibatch_size: 4
pre_encode_inputs: true
parallelism:
use_gpu: true
model_parallelism_size: 1
synchronize_gpus_after_scoring: false
empty_cuda_cache_after_scoring: false
rl_script_args:
path: ???
seed: 1
number_envs: 2
num_steps: 1000
max_episode_steps: 3
frames_per_proc: 40
reward_shaping_beta: 0
discount: 0.99
lr: 1e-6
beta1: 0.9
beta2: 0.999
gae_lambda: 0.99
entropy_coef: 0.01
value_loss_coef: 0.5
max_grad_norm: 0.5
adam_eps: 1e-5
clip_eps: 0.2
epochs: 4
batch_size: 16
action_space: ["turn_left","turn_right","go_forward","pick_up","drop","toggle"]
saving_path_logs: Desktop/workspace2/Grounding_LLMs_with_online_RL/logs
name_experiment: 'llm_mtrl'
name_model: 'T5small'
saving_path_model: Desktop/workspace2/Grounding_LLMs_with_online_RL/model
name_environment: 'BabyAI-KeyCorridorS3R3-v0'
number_episodes: 10
language: 'english'
load_embedding: true
use_action_heads: false
template_test: 1
zero_shot: true
modified_action_space: false
new_action_space: #["rotate_left","rotate_right","move_ahead","take","release","switch"]
spm_path: "YOUR_PATH_TO_PROJECT/experiments/agents/drrn/spm_models/unigram_8k.model"
random_agent: true
get_example_trajectories: false
nbr_obs: 3
im_learning: false
im_path: ""
bot: false
It returns that:
2023-11-11 22:26:56,396][lamorel_logger][INFO] - Init rl-llm group for process 1
[2023-11-11 22:26:56,396][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
[2023-11-11 22:26:56,396][lamorel_logger][INFO] - Init rl-llm group for process 0
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 1
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 0
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-11 22:26:56,408][lamorel_logger][INFO] - 6 gpus available for current LLM but using only model_parallelism_size = 1
[2023-11-11 22:26:56,409][lamorel_logger][INFO] - Devices on process 1 (index 0): [0]
Parallelising HF LLM on 1 devices
Loading model t5-small
Error executing job with overrides: ['rl_script_args.path=~/Desktop/workspace2/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py', 'lamorel_args.accelerate_args.machine_rank=0']
Traceback (most recent call last):
File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 393, in main
lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 53, in __init__
Server(
File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 40, in __init__
self._model = HF_LLM(config.llm_args, devices, use_cpu)
File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 38, in __init__
device_map = infer_auto_device_map(
File "~/miniconda3/envs/dlp/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 923, in infer_auto_device_map
max_memory = get_max_memory(max_memory)
File "~/miniconda3/envs/dlp/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 674, in get_max_memory
raise ValueError(
ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'
Hi,
What is your version of Accelerate? The passed device isn't recognized which is weird.
Please see https://github.com/flowersteam/lamorel/issues/24 as it seems to be due to pythorch's version.
Hello, where is the directory of the file "lamorel_launcher.launch", I checked the folder "lamorel_launcher" in lamorel, yet I can't find it, thanks!
I'm using 6GPUs on a single machine. This is my command:
and
It returns: ModuleNotFoundError: No module named 'experiments'