flowersteam / Grounding_LLMs_with_online_RL

We perform functional grounding of LLMs' knowledge in BabyAI-Text
MIT License
215 stars 24 forks source link

Distribution Caculate #4

Closed CRLqinliang closed 1 year ago

CRLqinliang commented 1 year ago

Hi, because I don't know how to use Slurm on my server, so I just run the launch.py on my PC (8 GPUs). And I set the local_gpu_config.yaml

lamorel_args:
  log_level: debug
  allow_subgraph_use_whith_gradient: false
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 2
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 3
  llm_args:
    model_type: seq2seq
    model_path: t5-large
    pretrained: true
    minibatch_size: 60
    pre_encode_inputs: true
    parallelism:
      use_gpu: true
      model_parallelism_size: 2
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
  updater_args:
rl_script_args:
  path: ???
  seed: 1
  number_envs: 2  # PPO 2
  num_steps: 1000
  max_episode_steps: 3
  simplification_str: easy
  frames_per_proc: 40
  reward_shaping_beta: 0
  discount: 0.99
  lr: 1e-6
  beta1: 0.9
  beta2: 0.999
  gae_lambda: 0.99
  entropy_coef: 0.01
  value_loss_coef: 0.5
  max_grad_norm: 0.5
  adam_eps: 1e-5
  clip_eps: 0.2
  epochs: 40
  batch_size: 16
  action_space: ["turn_left","turn_right","go_forward","pick_up","drop","toggle"]
  saving_path_logs: ???
  name_experiment: 'llm_mtrl'
  name_model: 'T5small'
  saving_path_model: ???
  name_environment: 'BabyAI-MixedTestLocal-v0'
  number_episodes: 10
  language: 'english'
  load_embedding: true
  use_action_heads: false
  template_test: 1
  zero_shot: true
  modified_action_space: false
  new_action_space: #["rotate_left","rotate_right","move_ahead","take","release","switch"]
  spm_path: "YOUR_PATH_TO_PROJECT/experiments/agents/drrn/spm_models/unigram_8k.model"
  random_agent: true
  get_example_trajectories: false
  nbr_obs: 3
  im_learning: false
  im_path: ""
  bot: false

And default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config: { }
distributed_type: MULTI_GPU
fsdp_config: { }
machine_rank: 0
main_process_ip: 127.0.0.1
main_process_port: 12345
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 3
use_cpu: false

But I got this: (What's the matter? It seems that I cannot run the same LLM on two GPUs..)

Exception has occurred: RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapperindex_select) File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 216, in forward torch.repeat_interleave(result, File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 109, in _process_calls llm_results.append(self._model(**_call)) File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 131, in run current_process_results = self._process_calls(calls_to_process) File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in init self.run() File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 54, in init Server( File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 384, in main lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(), File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 491, in main() RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapperindex_select)

Thank you so much.

ClementRomac commented 1 year ago

Hi,

Your configs seem good (you are using only 4 GPUs though with 2 LLMs on 2 GPUs each). It looks like a bug in Lamorel. Let me have a look at this.

CRLqinliang commented 1 year ago

Okay, thanks. By the way, I found another problem when I set local_gpu_config.yaml like this: It might be another bug in Lamorel.

lamorel_args:
  log_level: info
  allow_subgraph_use_whith_gradient: false
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 4
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 5
  llm_args:
    model_type: seq2seq
    model_path: t5-small
    pretrained: true
    minibatch_size: 192
    pre_encode_inputs: true
    parallelism:
      use_gpu: true
      model_parallelism_size: 1
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
  updater_args:
rl_script_args:

and default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config: { }
distributed_type: MULTI_GPU
fsdp_config: { }
machine_rank: 0
main_process_ip: 127.0.0.1
main_process_port: 12345
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 5
use_cpu: false
list index out of range
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/dispatcher.py", line 93, in __dispatch_batches
    _call["candidates"] = [[call["candidates"][j][_idx] for _idx in _minibatches[_handler]]]
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/dispatcher.py", line 160, in dispatch
    _scattered_calls = self.__dispatch_batches(calls)
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 130, in run
    calls_to_process = self._dispatcher.dispatch(method_calls)
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in __init__
    self.run()
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 54, in __init__
    Server(
  File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 384, in main
    lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
  File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 491, in <module>
    main()
IndexError: list index out of range
CRLqinliang commented 1 year ago

And this is my GPUs configuration:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:1A:00.0 Off |                  Off |
| 33%   32C    P8    31W / 260W |   7921MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     On   | 00000000:1C:00.0 Off |                  Off |
| 33%   33C    P8    27W / 260W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 6000     On   | 00000000:1D:00.0 Off |                  Off |
| 35%   34C    P8    29W / 260W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 6000     On   | 00000000:1E:00.0 Off |                  Off |
| 34%   35C    P8    32W / 260W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Quadro RTX 6000     On   | 00000000:3D:00.0 Off |                  Off |
| 34%   30C    P8    31W / 260W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Quadro RTX 6000     On   | 00000000:3F:00.0 Off |                  Off |
| 33%   33C    P8    28W / 260W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Quadro RTX 6000     On   | 00000000:40:00.0 Off |                  Off |
| 34%   33C    P8    29W / 260W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Quadro RTX 6000     On   | 00000000:41:00.0 Off |                  Off |
| 34%   32C    P8    34W / 260W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
ClementRomac commented 1 year ago

Okay I'll investigate the other one too. How are you launching each process?

By the way there is no need to change the default_config.yml, Lamorel overrides what's needed given your local_gpu_config.yml.

CRLqinliang commented 1 year ago

Okay. Here is how I launch each process:

  1. Create a .json file in VScode, with this code: (You could modify the code which depends on how many processes you gonna run. For example, you need 1 process for RL and 2 processes for LLM.
  2. And use debug of VScode, and click debug button of each "name", it will automatically create the process, and there you go.
  3. And you should click the debug button in the launch.py (because it will run the current file)
    {
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Debug Process 0",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "justMyCode": true,
            "args": [
                "--config-path","/home/jovyan/Grounding_LLMs_with_online_RL/experiments/configs",
                "--config-name","local_gpu_config.yaml",
                "rl_script_args.path=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py",
                "rl_script_args.saving_path_logs=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
                "rl_script_args.saving_path_model=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
                "lamorel_args.accelerate_args.machine_rank=0"
            ],
            "stopOnEntry": false,
            "showReturnValue": true
        },
        {
            "name": "Python: Debug Process 1",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "justMyCode": true,
            "args": [                
                "--config-path","/home/jovyan/Grounding_LLMs_with_online_RL/experiments/configs",
                "--config-name","local_gpu_config.yaml",
                "rl_script_args.path=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py",
                "rl_script_args.saving_path_logs=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
                "rl_script_args.saving_path_model=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
                "lamorel_args.accelerate_args.machine_rank=1"
        ],
            "stopOnEntry": false,
            "showReturnValue": true
        },
       { 
            "name": "Python: Debug Process 2",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "justMyCode": true,
            "args": [                
                "--config-path","/home/jovyan/Grounding_LLMs_with_online_RL/experiments/configs",
                "--config-name","local_gpu_config.yaml",
                "rl_script_args.path=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py",
                "rl_script_args.saving_path_logs=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
                "rl_script_args.saving_path_model=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
                "lamorel_args.accelerate_args.machine_rank=2"
        ],
            "stopOnEntry": false,
            "showReturnValue": true
        }
    }
ClementRomac commented 1 year ago

Thank you, your launching process seems good at first sight. This pull request (https://github.com/flowersteam/lamorel/pull/9) should fix your first issue (it has been merged on the main branch of Lamorel).

CRLqinliang commented 1 year ago

... It seems doesn't work. I create a new empty file and clone the latest code from the GitHub, And this is my configuration for local_gpu_config.yaml.

lamorel_args:
  log_level: info
  allow_subgraph_use_whith_gradient: true
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 2
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 3
  llm_args:
    model_type: seq2seq
    model_path: t5-small
    pretrained: true
    pre_encode_inputs: true
    minibatch_size: 4
    parallelism:
      use_gpu: true
      model_parallelism_size: 2
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
  updater_args:
rl_script_args:
  path: ???

It goes like this , it seems like there is a bug about the _module_functionkeys, there is no "__score", but "score".

Exception has occurred: TypeError
expected Tensor as element 0 in argument 0, but got NoneType
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 269, in forward
    batch_results[k] = torch.cat(batch_results[k])
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 109, in _process_calls
    llm_results.append(self._model(**_call))
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 131, in run
    current_process_results = self._process_calls(calls_to_process)
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in __init__
    self.run()
  File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 54, in __init__
    Server(
  File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 401, in main
    lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
  File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 508, in <module>
    main()
TypeError: expected Tensor as element 0 in argument 0, but got NoneType
ClementRomac commented 1 year ago

I haven't managed to reproduce your errors so far but spotted other issues. Some were fixed in Lamorel (https://github.com/flowersteam/lamorel/tree/fix_pre_encode_inputs), others in this repo (https://github.com/flowersteam/Grounding_LLMs_with_online_RL/tree/matching_new_lamorel_api). With these, I can run experiments without any problem.

Could you please try these?

CRLqinliang commented 1 year ago

Yep, it success to run, after I add the "unsqueeze(0)" (which you could find in the commit request).

ClementRomac commented 1 year ago

Great! I can't seen any request though?

CRLqinliang commented 1 year ago

Sorry, I change it back, it goes well now. By the way, could you please update the README file of this project? This project is really good for some RL researchers like me, I really want to use this code on my next paper. (For example, how to understand the parameter of rl_script_args, like zero-shot/ template_test..) Thx!

ClementRomac commented 1 year ago

Perfect! Sure. Let me close this issue and open one on README improvements.