Connection error - Githubissues

yone456 commented 8 months ago

Hello! I tried an experiment using the llama2 13b model and got a CONNECTION ERROR.

RL script

python -m lamorel_launcher.launch --config-path /home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning --config-name local_gpu_config rl_script_args.path=home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py lamorel_args.accelerate_args.machine_rank=0 lamorel_args.llm_args.model_path=/home/xxx/llama/llama-2-13b

LLM server

python -m lamorel_launcher.launch --config-path /home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning --config-name local_gpu_config rl_script_args.path=home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py lamorel_args.accelerate_args.machine_rank=1 lamorel_args.llm_args.model_path=/home/xxx/llama/llama-2-13b

The following error occurred when starting the LLM server after running it using the above command.

ConnectionError: Tried to launch distributed communication on port 30004, but another process is utilizing it. Please specify a different port (such as using the ----main_process_port flag or specifying a different main_process_port in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to 0.

Could you please advise me on how to resolve the error?

ClementRomac commented 8 months ago

Hi,

I encounter the same error when I launch the RL script first. It appears there is a conflict of master processes when manually launching two processes on the same machine.

I will investigate this. In the meantime there are 2 solutions for you: 1) Let torch launch the two processes: set num_machines:1 in your config and only launch your RL script as you did in the example above (the lamorel_launcher will ask torch to launch the two processes) 2) Keep on launching the two processes manually but launch the LLM server first.

yone456 commented 8 months ago

Thanks for the advice. I was able to run it thanks to your advice. I was able to run the Flan-T5 and other models, but not with regard to llama and llama2. Will you ever cover the llama and llama2 models in the future?

ClementRomac commented 8 months ago

What is the matter with Llama?

For your information, I am currently working on adding a couple of things (along with several fixes):

Quantization (the possibility to load models in 4 bits)
Better caching, especially for decoder-only models (which also simplifies how to add custom module functions on top of them)
Better support of decoder-only models

With these improvements, I am able to run and train (with QLoRA) models like Llama2, OPT or Mistral.

These should arrive shortly here (in the coming weeks).

yone456 commented 8 months ago

That news is great news for me. I am very much looking forward to the arrival of those features.

By the way, I get the following error when I start llama2.

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 6.60it/s] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:21<00:00, 7.31s/it] Using pad_token, but it is not set yet. trainable params: 19401729 || all params: 13035266049 || trainable%: 0.14884029928555545 Error executing job with overrides: ['lamorel_args.accelerate_args.machine_rank=1'] Traceback (most recent call last): File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 393, in main() File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 255, in main lm_server = Caller(config_args.lamorel_args, File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 53, in init Server( File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 58, in init DDP(self._model, process_group=self._llm_group, File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in init self._log_and_throw( File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw raise err_type(err_msg) ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.

ClementRomac commented 7 months ago

Which model are you using exactly? Also, what are your versions of transformers and accelerate?

yone456 commented 7 months ago

The version I am using is the following version.

accelerate 0.21.0 transformers 4.33.0

The model used is Llama-2-13b-hf. https://huggingface.co/meta-llama/Llama-2-13b-hf

ClementRomac commented 7 months ago

Ok so first to give more details about your initial issue with Connection Error, it's Accelerate that checks the asked port isn't already in use. When a process with a rank > 0 is launched first, the port isn't already in use (at it is first) AND torch distributed doesn't launch anything on this port as only the process with rank=0 should launch the master process. So then when you launch the process with rank=0, the port is still free and everything runs smoothly. However, when you do the opposite, the process with rank=0 (which is launched first) starts the main process listening on the asked port, but Accelerate still checks for the second process with rank > 0 that the port is free.

I guess this check should take into account the rank of the current process. I haven't opened any issue yet as manually launching two "machines" on the same machine isn't really a "normal" use case of Accelerate. So I would advise setting the num_machines:1.

Concerning Llama, this is surprising as it seems the piece of code putting the LLM's weights on a CUDA device is not working as expected and your LLM is still on the fake 'meta' device when passed to DDP. Could you try upgrading Accelerate?

ClementRomac commented 7 months ago

It may also be related to your pytorch version. See https://github.com/flowersteam/lamorel/issues/24.

yone456 commented 7 months ago

Thanks to your advice the error was avoided. Thank you very much.

Sorry, I have about two questions. When using a decoder only model (causal) like llama2, I get the following error in main.py of PPO_LoRA_finetuning. Is there any workaround for this error?

File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in init self.run() File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 131, in run current_process_results = self._process_calls(calls_to_process) File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 109, in _process_calls llm_results.append(self._model(_call)) File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 285, in forward results = _fn(_outputs, File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 39, in forward raise NotImplementedError() NotImplementedError

Also, how can I do fine tuning with multiple GPUs in PPO_LoRA_finetuning?

ClementRomac commented 7 months ago

Hi,

Decoder-Only support is part of the multiple changes I have to push. This update will be added in a PR tomorrow morning. Examples will also be slightly modified, so you may have to adapt your code.

Concerning multi-GPU, if you have set lamorel_args.llm_args.parallelism.use_gpu=true, you should be able to set how many GPUs each LLM uses with lamorel_args.llm_args.parallelism.model_parallelism_size. Example: you have set lamorel_args.distributed_setup_args.n_llm_processes=1 and lamorel_args.llm_args.parallelism.model_parallelism_size=2, lamorel will deploy one LLM and will expect at least 2 GPUs on your system to assign them to the LLM. If you set lamorel_args.distributed_setup_args.n_llm_processes=2, lamorel will deploy 2 LLMs and expects at least 4 GPUs on your system (the first 2 will be assigned to the first LLM, the two others two the second LLM).

ClementRomac commented 7 months ago

Hi,

The Decoder-Only support has come at last! Here's the PR: https://github.com/flowersteam/lamorel/pull/26

It has been merged into the main branch. All examples have been modified. Let me know if you face any issue :)

yone456 commented 7 months ago

Thanks for the great update! I immediately tried it with llama2 and got the following error output, but I was able to avoid the error by setting device_map = "auto".

ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.

Also, when I performed PPO_Lora_finetuning on llama2, I got the following warning output, is there any solution...?

[2023-11-22 15:06:46,565][root][WARNING] - PPO ratio != 1 !!

ClementRomac commented 7 months ago

I can't manage to reproduce your error when loading Llama 2... If you know what happens let me know.

Concerning the warning, models that use Rotary PE (e.g. Llama 2, Mistral) are affected by padding: https://github.com/huggingface/transformers/issues/25921

As we are batching multiple transitions in the PPOUpdater (and using padding to do so), the logprobs differ from the ones obtained when collecting transitions. I have unfortunately no solution for now. I am actually currently trying to see if I can make Mistral or Llama2 converge even with this issue.

flowersteam / lamorel

Connection error #23