RnDProjectsDeebul / ManojKolpeThesis

Mozilla Public License 2.0
2 stars 1 forks source link

issue with university cluster #17

Open Manojkl opened 2 years ago

Manojkl commented 2 years ago

University cluster is somehow automatically stopping with the following error, did you anytime face such issues?

[A/var/spool/slurmd/job481568/slurm_script: line 19: 15051 Killed                  python unet_v7.py
slurmstepd: error: Detected 1 oom-kill event(s) in step 481568.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
deebuls commented 2 years ago

Its out of memory issue. Either gpu or ram . If gpu then reduce batch size If ram then increase the ram allocation in slurm script.

If gpu is happening after some time then check ur code are u initialzing skme large variable and not deleting it . Ideally u shld not create any additional variable in gpu other than required .

On Wed, Aug 3, 2022, 4:52 PM Manoj Kolpe @.***> wrote:

Assigned #17 https://github.com/RnDProjectsDeebul/ManojKolpeThesis/issues/17 to @deebuls https://github.com/deebuls.

— Reply to this email directly, view it on GitHub https://github.com/RnDProjectsDeebul/ManojKolpeThesis/issues/17#event-7119404347, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUOA2JIS7BDVWFRUIMSX7DVXKBTTANCNFSM55PFFSVQ . You are receiving this because you were assigned.Message ID: @.*** .com>

Manojkl commented 2 years ago

I was plotting a figure inside the script to check the result. The cluster is behaving very weirdly these days. Previously I could see the error and out. But now the out is not getting updated. Also, I could load the pickle file outside in the google collab, but if I open the same in the cluster, it throws an error as Pickle - EOFError: Ran out of input. Strange behavior. Sometimes, all the code get disappears.