Open saurabhbbjain opened 4 days ago
Hi @saurabhbbjain, how are you running the pipeline? It's a pipeline using ray
, we normally run them in a slurm cluster that controls ray, as can be seen here in the docs. I will let @gabrielmbmb answer it in case he has access to how the pipeline was run.
Hi @saurabhbbjain, the original code is using 8 GPUs per step as we're using vLLM
with tensor_parallel_size==8
, and as @plaguss mentions, it's also using the RayPipeline
which you don't need to use with a single machine setup. I've updated the pipeline to work with your step (haven't tested):
.ray()
to use the Pipeline
instead of the RayPipeline
tensor_parallel_size
in all the vLLM
s of the pipeline. MagpieGenerator
will use 4 GPUs and the other two steps, 2 GPUs each.https://gist.github.com/gabrielmbmb/2df9a1041a649783efb3c3cf0ffb1376
I am executing ifeval_like_data.py file with 8 A100 GPUs and receiving the following error:
I am not able to find why I am receiving this error despite providing 8 GPUs. I am using Llama-3.2-1B-Instruct model.