camall3n / onager

Lightweight python library for launching experiments and tuning hyperparameters, either locally or on a cluster
MIT License
20 stars 4 forks source link

add default enforce-binding flag for SLURM GPU #52

Closed zhouzypaul closed 1 year ago

zhouzypaul commented 1 year ago

Singh Saluja mentions that:

chedMD recommends using --gres-flags=enforce-binding flag that ensures all GPUs are allocated on the same CPU socket.

camall3n commented 1 year ago

Thanks for the PR! Is the recommendation to use this flag every time? Is there any reason not to? What if you need multiple GPUs?

zhouzypaul commented 1 year ago

Thanks for the diligence! I tried to read the documentation here but couldn't grasp it. I will follow up with Singh.

zhouzypaul commented 1 year ago

Is the recommendation to use this flag every time?

Yes

Is there any reason not to?

No

What if you need multiple GPUs?

Especially if one needs multiple GPUs, this flag ensures they can all talk to each other. If a job is requiring only one GPU, it doesn't matter whether to include this flag or not.

This may be a hardware-specific issue. On Oscar, all the GPU nodes have 2 CPUs with 8 GPUs, and so having this flag ensures the CPUs and GPU talk to each other correctly. However, I don't believe that having this flag will take away things for different GPU node architectures.

camall3n commented 1 year ago

I read the docs on schedmd, and I can't figure out what the purpose of the flag actually is. Is there a specific problem you were trying to solve?

Is there a reason the flag can't be added manually as a pass-through argument for slurm when launching jobs?

On Fri, Apr 14, 2023 at 8:14 AM zhouzypaul @.***> wrote:

Is the recommendation to use this flag every time? Yes

Is there any reason not to? No

What if you need multiple GPUs? Especially if one needs multiple GPUs, this flag ensures they can all talk to each other. If a job is requiring only one GPU, it doesn't matter whether to include this flag or not.

This may be a hardware-specific issue. On Oscar, all the GPU nodes have 2 CPUs with 8 GPUs, and so having this flag ensures the CPUs and GPU talk to each other correctly. However, I don't believe that having this flag will take away things for different GPU node architectures.

— Reply to this email directly, view it on GitHub https://github.com/camall3n/onager/pull/52#issuecomment-1508766782, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJQVQHXJDDWLY7RKM4IF5TXBFSWBANCNFSM6AAAAAAW33HWDY . You are receiving this because you commented.Message ID: @.***>

zhouzypaul commented 1 year ago

Yes, it can be manually added as a pass-through argument. The problem is that people don't generally realize the need for this flag, and one should always use it when launching GPU jobs, as long as I understand.

But we can table this flag if you don't feel confident.

camall3n commented 1 year ago

Okay, suppose we add it by default. Can you check if the user can override it by including --gres-flags=disable-binding as a pass-through arg afterwards?

On Fri, Apr 14, 2023 at 1:25 PM zhouzypaul @.***> wrote:

Yes, it can be manually added as a pass-through argument. The problem is that people don't generally realize the need for this flag, and one should always use it when launching GPU jobs, as long as I understand.

But we can table this flag if you don't feel confident.

— Reply to this email directly, view it on GitHub https://github.com/camall3n/onager/pull/52#issuecomment-1509212951, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJQVQCU4A4X7OOJM7NVTVDXBGXENANCNFSM6AAAAAAW33HWDY . You are receiving this because you commented.Message ID: @.***>

zhouzypaul commented 1 year ago

Do you mean passing --gres-flags=disable-binding in as an argument to onager?

It didn't give me any errors doing this, but I don't think this is a good idea. And I'm not sure whether the results are enable-binding or disable-binding.

I think the solutions are:

camall3n commented 1 year ago

I'm inclined to leave it how it is, since it currently works as a pass-through argument if necessary.