Receiving error: The number of required GPUs exceeds the total number of available GPUs in the placement group

saurabhbbjain commented 4 days ago

I am executing ifeval_like_data.py file with 8 A100 GPUs and receiving the following error:

`
[10/21/24 06:53:04] ERROR ['distilabel.pipeline'] ❌ Failed to local.py:302
load step
'i_f_eval_kwargs_assignator_0': Step
load failed: The number of required
GPUs exceeds the total number of
available GPUs in the placement group.

                         For further information visit                      
                         'https://distilabel.argilla.io/latest/             
                         api/pipeline/step_wrapper'                         

[10/21/24 06:53:05] ERROR ['distilabel.pipeline'] ❌ Failed to local.py:302
load step
'i_f_eval_instruction_id_list_assignat
or_0': Step load failed: The number of
required GPUs exceeds the total number
of available GPUs in the placement
group.

                         For further information visit                      
                         'https://distilabel.argilla.io/latest/             
                         api/pipeline/step_wrapper'                         
                ERROR    ['distilabel.pipeline'] ❌ Failed to   local.py:302
                         load step 'magpie_generator_0': Step               
                         load failed: The number of required                
                         GPUs exceeds the total number of                   
                         available GPUs in the placement group.   
                         For further information visit                      
                         'https://distilabel.argilla.io/latest/             
                         api/pipeline/step_wrapper'                         
                ERROR    ['distilabel.pipeline'] ❌ Failed to   base.py:1201
                         load all the steps of stage 0                      

*** SIGTERM received at time=1729518785 on cpu 126 ***
*** SIGTERM received at time=1729518785 on cpu 62 ***
*** SIGTERM received at time=1729518785 on cpu 195 ***
PC: @ 0x5a9437 (unknown) _PyEval_EvalFrameDefault
@ 0x7ffff7e0f090 (unknown) (unknown)
@ ... and at least 3 more frames
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 62 ***
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: PC: @ 0x5a9437 (unknown) _PyEval_EvalFrameDefault
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: @ 0x7ffff7e0f090 (unknown) (unknown)
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: @ ... and at least 3 more frames
PC: @ 0x5f9269 (unknown) _PyObject_GetMethod
PC: @ 0x5a96dc (unknown) _PyEval_EvalFrameDefault
@ 0x7ffff7e0f090 72985216 (unknown)
@ 0x7ffff7e0f090 (unknown) (unknown)
@ ... and at least 4 more frames
[2024-10-21 06:53:05,994 E 260 260] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 126 ***
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: PC: @ 0x5a96dc (unknown) _PyEval_EvalFrameDefault
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: @ 0x7ffff7e0f090 (unknown) (unknown)
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: @ ... and at least 4 more frames
@ 0x94eca0 (unknown) (unknown)
[2024-10-21 06:53:06,000 E 261 261] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 195 ***
[2024-10-21 06:53:06,000 E 261 261] logging.cc:440: PC: @ 0x5f9269 (unknown) _PyObject_GetMethod
[2024-10-21 06:53:06,004 E 261 261] logging.cc:440: @ 0x7ffff7e0f090 72985216 (unknown)
[2024-10-21 06:53:06,009 E 261 261] logging.cc:440: @ 0x94eca0 (unknown) (unknown)
╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dataset = None │ │
│ │ distiset = None │ │
│ │ logging_handlers = None │ │
│ │ manager = <multiprocessing.managers.SyncManager object at │ │
│ │ 0x7ffe41227f40> │ │
│ │ num_processes = 3 │ │
│ │ parameters = None │ │
│ │ pool = <distilabel.pipeline.local._NoDaemonPool │ │
│ │ state=TERMINATE pool_size=3> │ │
│ │ self = <distilabel.pipeline.local.Pipeline object at │ │
│ │ 0x7ffe46a00df0> │ │
│ │ storage_parameters = None │ │
│ │ use_cache = False │ │
│ │ use_fs_to_pass_data = False │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to load all the steps. Could not run pipeline.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 980, in _bootstrap_inner`

I am not able to find why I am receiving this error despite providing 8 GPUs. I am using Llama-3.2-1B-Instruct model.

plaguss commented 3 days ago

Hi @saurabhbbjain, how are you running the pipeline? It's a pipeline using ray, we normally run them in a slurm cluster that controls ray, as can be seen here in the docs. I will let @gabrielmbmb answer it in case he has access to how the pipeline was run.

gabrielmbmb commented 3 days ago

Hi @saurabhbbjain, the original code is using 8 GPUs per step as we're using vLLM with tensor_parallel_size==8, and as @plaguss mentions, it's also using the RayPipeline which you don't need to use with a single machine setup. I've updated the pipeline to work with your step (haven't tested):

We remove the .ray() to use the Pipeline instead of the RayPipeline
We update the tensor_parallel_size in all the vLLMs of the pipeline. MagpieGenerator will use 4 GPUs and the other two steps, 2 GPUs each.

https://gist.github.com/gabrielmbmb/2df9a1041a649783efb3c3cf0ffb1376

argilla-io / distilabel

Receiving error: The number of required GPUs exceeds the total number of available GPUs in the placement group #1044