czimaginginstitute / MotionCor3

Anisotropic correction of beam induced sample motion for cryo-electron microscopy and tomography
BSD 3-Clause "New" or "Revised" License
41 stars 2 forks source link

Flag to Cancel check of free GPUs? #17

Closed pconesa closed 2 days ago

pconesa commented 4 months ago

Hi! We have a couple of cases where this CheckFreeGPus strategy fails.

A:) Multiuser Motioncor2 run on the same Machine. This has been reported by one of our users (Scipion users) who have a single workstation with multiple users. I believe this is risky and also hard to deal with. But Motioncor2 fails to read the txt file generated by another account.

B:) Running Motioncor2 in a cluster through SLURM. The way slurm works with GPUs is by always specifying GPU ID 0 to any job. It is the responsibility of SLURM to then map GPU 0 to the right available GPU (let's say 4).

If a second job is sent to that node.... motioncor2 will still be called with a 0 as GPU id, but it is not GPU 0 but other GPU.

I believe motioncor2, actually uses the right GPU define by SLURM (I think is transparent for motioncor) but motioncor2 writes in the txt file 0. Which is not the right GPU it is using.

Is this something that can be fixed, at least by canceling (with a flag) the GPU check done by motioncor2?

danielmarchan3 commented 4 months ago

I am glad Pablo is opening this up; I am one of the B users. I am attempting to launch several MotionCorr processes in parallel to a SLURM queue. Each MotionCorr process is assigned to GPU ID 0. However, this does not necessarily correspond to the GPU ID assigned by SLURM, and several of these jobs are failing, claiming that all GPUs are in use when this is not the case. It would be very helpful to have a flag to disable the MotionCor2_FreeGpus.txt file when using queue systems.

pconesa commented 1 month ago

Any update on this? Is the a way to cancel the GPU availability check?

pconesa commented 1 month ago

We are thinking about making a PR with this @szhengczii but we need some advice:

1.- Do we actually need a new argument or could we say "if UseGpus" then we activate the "GPU management code inside motioncor"? WHat do you think?

2.- Is the documentation, to describe the new argument available to edit it?

sunchang1990 commented 1 month ago

I am glad Pablo is opening this up; I am one of the B users. I am attempting to launch several MotionCorr processes in parallel to a SLURM queue. Each MotionCorr process is assigned to GPU ID 0. However, this does not necessarily correspond to the GPU ID assigned by SLURM, and several of these jobs are failing, claiming that all GPUs are in use when this is not the case. It would be very helpful to have a flag to disable the MotionCor2_FreeGpus.txt file when using queue systems.

I have tried the Actual GPU ID assigned by SLURM and fed it to the MotionCor3 command; however, the job still failed, possibly because that ID didn't match the content of MotionCor3_FreeGpus.txt.

Adding an option to disable the GPU availability check would give more flexibility to running MotionCor3 in a job queuing environment.

szhengczii commented 1 month ago

I think it is a good idea to decide, based on "-UseGpus", whether to check the free gpu file or not. When -UseGpus is not present, the current implementation assumes all GPUs are used. Hence, we can skip checking when -UseGpus is not there or it asks for all the GPUs given in the command line. Otherwise we still check.

Best, Shawn

On Fri, Jun 7, 2024 at 12:31 PM Chang Sun @.***> wrote:

I am glad Pablo is opening this up; I am one of the B users. I am attempting to launch several MotionCorr processes in parallel to a SLURM queue. Each MotionCorr process is assigned to GPU ID 0. However, this does not necessarily correspond to the GPU ID assigned by SLURM, and several of these jobs are failing, claiming that all GPUs are in use when this is not the case. It would be very helpful to have a flag to disable the MotionCor2_FreeGpus.txt file when using queue systems.

I have tried the Actual GPU ID assigned by SLURM and fed it to the MotionCor3 command; however, the job still failed, possibly because that ID didn't match the content of MotionCor3_FreeGpus.txt.

Adding an option to disable the GPU availability check would give more flexibility to running MotionCor3 in a job queuing environment.

— Reply to this email directly, view it on GitHub https://github.com/czimaginginstitute/MotionCor3/issues/17#issuecomment-2155406644, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBUDUP7H22JCU34YAIINNF3ZGIDCNAVCNFSM6AAAAABD52QN5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVGQYDMNRUGQ . You are receiving this because you were mentioned.Message ID: @.***>

pconesa commented 4 weeks ago

Thanks Shawn!

pconesa commented 2 days ago

This may be solved by:

https://github.com/czimaginginstitute/MotionCor3/pull/24 ?