Open Manojkl opened 2 years ago
Its out of memory issue. Either gpu or ram . If gpu then reduce batch size If ram then increase the ram allocation in slurm script.
If gpu is happening after some time then check ur code are u initialzing skme large variable and not deleting it . Ideally u shld not create any additional variable in gpu other than required .
On Wed, Aug 3, 2022, 4:52 PM Manoj Kolpe @.***> wrote:
Assigned #17 https://github.com/RnDProjectsDeebul/ManojKolpeThesis/issues/17 to @deebuls https://github.com/deebuls.
— Reply to this email directly, view it on GitHub https://github.com/RnDProjectsDeebul/ManojKolpeThesis/issues/17#event-7119404347, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUOA2JIS7BDVWFRUIMSX7DVXKBTTANCNFSM55PFFSVQ . You are receiving this because you were assigned.Message ID: @.*** .com>
I was plotting a figure inside the script to check the result. The cluster is behaving very weirdly these days. Previously I could see the error and out. But now the out is not getting updated. Also, I could load the pickle file outside in the google collab, but if I open the same in the cluster, it throws an error as Pickle - EOFError: Ran out of input. Strange behavior. Sometimes, all the code get disappears.
University cluster is somehow automatically stopping with the following error, did you anytime face such issues?