Support to lock GPUs for each task according to GPU_PER_TASK environment

kikakkz commented 1 year ago

original implementation lock bellman.gpu.lock to lock all GPUs in default. And in my test with cuda single 3080 for C2: 15m52s double 3080 for C2: 12m36s so i think it's better to let user to choice if they need to use all GPUs for one C2 task. a environment named GPU_PER_TASK is introduced to let user to set. if GPU_PER_TASK is not set, then we use all GPUs for one C2, just like current implementation; if GPU_PER_TASK == 0, just like above; if GPU_PER_TASK > 0, use GPU_PER_TASK GPUs (up to devices.len()) for one C2 task.

vmx commented 1 year ago

With the CUDA_VISIBLE_DEVICES environment variable you can specify per program which GPUs should be visible. This way you can specify exactly which GPUs you would run it. Would this work for your use case instead of this patch?

kikakkz commented 1 year ago

With the CUDA_VISIBLE_DEVICES environment variable you can specify per program which GPUs should be visible. This way you can specify exactly which GPUs you would run it. Would this work for your use case instead of this patch?

Not exactly. With CUDA_VISIBLE_DEVICES the whole process can only see the specified device. So if we have a machine with multiple card, we have to run multiple processes. and it's even worse if we run PC1/PC2/C2 in one worker. We have to isolate memory, cpu threads, GPU device, nvme space for each process. That's really complicated for automatic deployment script. With this PR, the process can run one C2 task, or multiple C2 task concurrently according to GPU_PER_TASK value.

vmx commented 1 year ago

@kikakkz thanks for the explanation. To make sure I understand your use case correctly. What you want is basically to be able to say "x number of GPUs form one unit". So let's say you have 6 GPUs in a machine, you set GPU_PER_TASK=2, then you kind of have 3 units of GPUs which can be used independently, but C2 would still use 2 GPUs.

vmx commented 1 year ago

Unrelated to the code, but still relevant: @kikakkz would it be possible for you to sign into CircleCI? I know it sounds weird, but this ways the CI would then be triggered correctly.

kikakkz commented 1 year ago

Unrelated to the code, but still relevant: @kikakkz would it be possible for you to sign into CircleCI? I know it sounds weird, but this ways the CI would then be triggered correctly.

i try, but i struggle to fail，hmm~ let me try again~

vmx commented 1 year ago

i try, but i struggle to fail，hmm~ let me try again~

Clearing the cookies might help (I got that information from CircleCI support, once there was a similar issue).

kikakkz commented 1 year ago

@kikakkz thanks for the explanation. To make sure I understand your use case correctly. What you want is basically to be able to say "x number of GPUs form one unit". So let's say you have 6 GPUs in a machine, you set GPU_PER_TASK=2, then you kind of have 3 units of GPUs which can be used independently, but C2 would still use 2 GPUs.

yes, exactly. actually in my test, dual GPUs for one C2 do not have so much promotion of performance compare to single GPU for one C2 task. so in my practice, i would like to use single GPU for one C2 task, and run multiple C2 tasks concurrently within one worker process. and of course if i have 6 GPUs, i can set GPU_PER_TASK to 2, then i can run 3 C2 tasks concurrently.

kikakkz commented 1 year ago

i try, but i struggle to fail，hmm~ let me try again~

Clearing the cookies might help (I got that information from CircleCI support, once there was a similar issue).

trying now, 😄

kikakkz commented 1 year ago

@vmx i revoke access, clear cache, clear cookie then re-login circle ci, but still fail. after i try to get fail issue, and try to get configuration file, i get the url like 'https://app.circleci.com/projects/github/filecoin-project/bellperson/config/?branchName=&pipelineNumber=1784' in which branchName is missed, then i just give branchName to be master i can get config.yml file of circleci. could you please help ?

kikakkz commented 1 year ago

and seems i cannot rerun the failed one because it miss all info in that record. should i recreate this PR and close this one ?

kikakkz commented 1 year ago

submit one more empty line and seems it triggered~ let's wait, 😄

vmx commented 1 year ago

CI still seems weird. When you work through the code review, you can try to create another PR and we'll see if that helps.

kikakkz commented 1 year ago

https://github.com/filecoin-project/bellperson/pull/300 create this one to test circle instead

filecoin-project / bellperson

Support to lock GPUs for each task according to GPU_PER_TASK environment #298