Closed CRLqinliang closed 1 year ago
Hi,
Your configs seem good (you are using only 4 GPUs though with 2 LLMs on 2 GPUs each). It looks like a bug in Lamorel. Let me have a look at this.
Okay, thanks. By the way, I found another problem when I set local_gpu_config.yaml like this: It might be another bug in Lamorel.
lamorel_args:
log_level: info
allow_subgraph_use_whith_gradient: false
distributed_setup_args:
n_rl_processes: 1
n_llm_processes: 4
accelerate_args:
config_file: accelerate/default_config.yaml
machine_rank: 0
num_machines: 5
llm_args:
model_type: seq2seq
model_path: t5-small
pretrained: true
minibatch_size: 192
pre_encode_inputs: true
parallelism:
use_gpu: true
model_parallelism_size: 1
synchronize_gpus_after_scoring: false
empty_cuda_cache_after_scoring: false
updater_args:
rl_script_args:
and default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: { }
distributed_type: MULTI_GPU
fsdp_config: { }
machine_rank: 0
main_process_ip: 127.0.0.1
main_process_port: 12345
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 5
use_cpu: false
list index out of range
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/dispatcher.py", line 93, in __dispatch_batches
_call["candidates"] = [[call["candidates"][j][_idx] for _idx in _minibatches[_handler]]]
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/dispatcher.py", line 160, in dispatch
_scattered_calls = self.__dispatch_batches(calls)
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 130, in run
calls_to_process = self._dispatcher.dispatch(method_calls)
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in __init__
self.run()
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 54, in __init__
Server(
File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 384, in main
lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 491, in <module>
main()
IndexError: list index out of range
And this is my GPUs configuration:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 On | 00000000:1A:00.0 Off | Off |
| 33% 32C P8 31W / 260W | 7921MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 6000 On | 00000000:1C:00.0 Off | Off |
| 33% 33C P8 27W / 260W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 6000 On | 00000000:1D:00.0 Off | Off |
| 35% 34C P8 29W / 260W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Quadro RTX 6000 On | 00000000:1E:00.0 Off | Off |
| 34% 35C P8 32W / 260W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Quadro RTX 6000 On | 00000000:3D:00.0 Off | Off |
| 34% 30C P8 31W / 260W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Quadro RTX 6000 On | 00000000:3F:00.0 Off | Off |
| 33% 33C P8 28W / 260W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Quadro RTX 6000 On | 00000000:40:00.0 Off | Off |
| 34% 33C P8 29W / 260W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Quadro RTX 6000 On | 00000000:41:00.0 Off | Off |
| 34% 32C P8 34W / 260W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Okay I'll investigate the other one too. How are you launching each process?
By the way there is no need to change the default_config.yml
, Lamorel overrides what's needed given your local_gpu_config.yml
.
Okay. Here is how I launch each process:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Debug Process 0",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": true,
"args": [
"--config-path","/home/jovyan/Grounding_LLMs_with_online_RL/experiments/configs",
"--config-name","local_gpu_config.yaml",
"rl_script_args.path=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py",
"rl_script_args.saving_path_logs=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
"rl_script_args.saving_path_model=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
"lamorel_args.accelerate_args.machine_rank=0"
],
"stopOnEntry": false,
"showReturnValue": true
},
{
"name": "Python: Debug Process 1",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": true,
"args": [
"--config-path","/home/jovyan/Grounding_LLMs_with_online_RL/experiments/configs",
"--config-name","local_gpu_config.yaml",
"rl_script_args.path=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py",
"rl_script_args.saving_path_logs=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
"rl_script_args.saving_path_model=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
"lamorel_args.accelerate_args.machine_rank=1"
],
"stopOnEntry": false,
"showReturnValue": true
},
{
"name": "Python: Debug Process 2",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": true,
"args": [
"--config-path","/home/jovyan/Grounding_LLMs_with_online_RL/experiments/configs",
"--config-name","local_gpu_config.yaml",
"rl_script_args.path=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py",
"rl_script_args.saving_path_logs=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
"rl_script_args.saving_path_model=/home/jovyan/Grounding_LLMs_with_online_RL/experiments/output",
"lamorel_args.accelerate_args.machine_rank=2"
],
"stopOnEntry": false,
"showReturnValue": true
}
}
Thank you, your launching process seems good at first sight. This pull request (https://github.com/flowersteam/lamorel/pull/9) should fix your first issue (it has been merged on the main branch of Lamorel).
... It seems doesn't work. I create a new empty file and clone the latest code from the GitHub, And this is my configuration for local_gpu_config.yaml.
lamorel_args:
log_level: info
allow_subgraph_use_whith_gradient: true
distributed_setup_args:
n_rl_processes: 1
n_llm_processes: 2
accelerate_args:
config_file: accelerate/default_config.yaml
machine_rank: 0
num_machines: 3
llm_args:
model_type: seq2seq
model_path: t5-small
pretrained: true
pre_encode_inputs: true
minibatch_size: 4
parallelism:
use_gpu: true
model_parallelism_size: 2
synchronize_gpus_after_scoring: false
empty_cuda_cache_after_scoring: false
updater_args:
rl_script_args:
path: ???
It goes like this , it seems like there is a bug about the _module_functionkeys, there is no "__score", but "score".
Exception has occurred: TypeError
expected Tensor as element 0 in argument 0, but got NoneType
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 269, in forward
batch_results[k] = torch.cat(batch_results[k])
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 109, in _process_calls
llm_results.append(self._model(**_call))
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 131, in run
current_process_results = self._process_calls(calls_to_process)
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in __init__
self.run()
File "/home/jovyan/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 54, in __init__
Server(
File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 401, in main
lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
File "/home/jovyan/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 508, in <module>
main()
TypeError: expected Tensor as element 0 in argument 0, but got NoneType
I haven't managed to reproduce your errors so far but spotted other issues. Some were fixed in Lamorel (https://github.com/flowersteam/lamorel/tree/fix_pre_encode_inputs), others in this repo (https://github.com/flowersteam/Grounding_LLMs_with_online_RL/tree/matching_new_lamorel_api). With these, I can run experiments without any problem.
Could you please try these?
Yep, it success to run, after I add the "unsqueeze(0)" (which you could find in the commit request).
Great! I can't seen any request though?
Sorry, I change it back, it goes well now. By the way, could you please update the README file of this project? This project is really good for some RL researchers like me, I really want to use this code on my next paper. (For example, how to understand the parameter of rl_script_args, like zero-shot/ template_test..) Thx!
Perfect! Sure. Let me close this issue and open one on README improvements.
Hi, because I don't know how to use Slurm on my server, so I just run the launch.py on my PC (8 GPUs). And I set the local_gpu_config.yaml
And default_config.yaml
But I got this: (What's the matter? It seems that I cannot run the same LLM on two GPUs..)
Thank you so much.