How to solve the "ModuleNotFoundError: No module named 'experiments'"?

Josh00-Lu commented 1 year ago

I'm using 6GPUs on a single machine. This is my command:

python -m lamorel_launcher.launch --config-path Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/configs  --config-name local_gpu_config rl_script_args.path=Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py lamorel_args.accelerate_args.machine_rank=1

and

python -m lamorel_launcher.launch --config-path Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/configs  --config-name local_gpu_config rl_script_args.path=Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py lamorel_args.accelerate_args.machine_rank=0

It returns: ModuleNotFoundError: No module named 'experiments'

Josh00-Lu commented 1 year ago

I have solved the problem. But I encountered another :( Now, I only use

python -m lamorel_launcher.launch --config-path Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/configs  --config-name local_gpu_config rl_script_args.path=Absolute/Path/To/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py lamorel_args.accelerate_args.machine_rank=0

As you mentioned in https://github.com/flowersteam/lamorel/issues/23#issuecomment-1790381781 With the local_gpu_config.yaml as:

  lamorel_args:
  log_level: info
  allow_subgraph_use_whith_gradient: true
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 1
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 1
  llm_args:
    model_type: seq2seq
    model_path: t5-small
    pretrained: true
    minibatch_size: 4
    pre_encode_inputs: true
    parallelism:
      use_gpu: true
      model_parallelism_size: 1
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
rl_script_args:
  path: ???
  seed: 1
  number_envs: 2
  num_steps: 1000
  max_episode_steps: 3
  frames_per_proc: 40
  reward_shaping_beta: 0
  discount: 0.99
  lr: 1e-6
  beta1: 0.9
  beta2: 0.999
  gae_lambda: 0.99
  entropy_coef: 0.01
  value_loss_coef: 0.5
  max_grad_norm: 0.5
  adam_eps: 1e-5
  clip_eps: 0.2
  epochs: 4
  batch_size: 16
  action_space: ["turn_left","turn_right","go_forward","pick_up","drop","toggle"]
  saving_path_logs: Desktop/workspace2/Grounding_LLMs_with_online_RL/logs
  name_experiment: 'llm_mtrl'
  name_model: 'T5small'
  saving_path_model: Desktop/workspace2/Grounding_LLMs_with_online_RL/model
  name_environment: 'BabyAI-KeyCorridorS3R3-v0'
  number_episodes: 10
  language: 'english'
  load_embedding: true
  use_action_heads: false
  template_test: 1
  zero_shot: true
  modified_action_space: false
  new_action_space: #["rotate_left","rotate_right","move_ahead","take","release","switch"]
  spm_path: "YOUR_PATH_TO_PROJECT/experiments/agents/drrn/spm_models/unigram_8k.model"
  random_agent: true
  get_example_trajectories: false
  nbr_obs: 3
  im_learning: false
  im_path: ""
  bot: false

It returns that:

2023-11-11 22:26:56,396][lamorel_logger][INFO] - Init rl-llm group for process 1
[2023-11-11 22:26:56,396][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
[2023-11-11 22:26:56,396][lamorel_logger][INFO] - Init rl-llm group for process 0
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 1
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 0
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-11 22:26:56,407][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-11 22:26:56,408][lamorel_logger][INFO] - 6 gpus available for current LLM but using only model_parallelism_size = 1
[2023-11-11 22:26:56,409][lamorel_logger][INFO] - Devices on process 1 (index 0): [0]
Parallelising HF LLM on 1 devices
Loading model t5-small
Error executing job with overrides: ['rl_script_args.path=~/Desktop/workspace2/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py', 'lamorel_args.accelerate_args.machine_rank=0']
Traceback (most recent call last):
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 393, in main
    lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 53, in __init__
    Server(
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 40, in __init__
    self._model = HF_LLM(config.llm_args, devices, use_cpu)
  File "~/Desktop/workspace2/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 38, in __init__
    device_map = infer_auto_device_map(
  File "~/miniconda3/envs/dlp/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 923, in infer_auto_device_map
    max_memory = get_max_memory(max_memory)
  File "~/miniconda3/envs/dlp/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 674, in get_max_memory
    raise ValueError(
ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'

ClementRomac commented 1 year ago

Hi,

What is your version of Accelerate? The passed device isn't recognized which is weird.

ClementRomac commented 1 year ago

Please see https://github.com/flowersteam/lamorel/issues/24 as it seems to be due to pythorch's version.

Ziyu0118 commented 2 months ago

Hello, where is the directory of the file "lamorel_launcher.launch", I checked the folder "lamorel_launcher" in lamorel, yet I can't find it, thanks!

flowersteam / Grounding_LLMs_with_online_RL

How to solve the "ModuleNotFoundError: No module named 'experiments'"? #19