Closed bidhya closed 10 months ago
The likely problem is memory-related even though this is not reported because the job is req-ued.
Possible solution: try running the v14.jl with less number of pixels (say step size of 70 on Discover)
Also likey failing on skynode with has 36 cores. This translates to 36 x 4 GB = 144 GB even though 190 GB available on node.
Use stepsize of ~40 or less. More the cores, lesser the stepsize.
Adding this line on slurm job script prevents the automatic re-queueing of failed job, effectively fixing this issue.
Again occurred this error when running on Milan nodes with 100 rows. Again due to memory overuse. Solution: reduce the number of rows to run on each mode.
Name=801_950.job Failed, Run time 02:27:52, NODE_FAIL, ExitCode 0 Again memory error on Milan node even with garbage collection.
Subject of Email: Slurm Job_id=9555 Name=401_480.job Failed, Run time 00:38:31, NODE_FAIL, ExitCode 0
The slurm output file does not have much info.
Discover Requeues this job with the following email subject line: Slurm Job_id=9555 Name=401_480.job Failed, Run time 00:38:31, NODE_FAIL, ExitCode 0 This re-queueing can cause repeated job failures if the temporary files remain on the node (different node that had failure from other job) and cannot be overwritten or deleted.