bidhya / verse

0 stars 0 forks source link

NODE_FAIL error on Discover #1

Closed bidhya closed 10 months ago

bidhya commented 1 year ago

Subject of Email: Slurm Job_id=9555 Name=401_480.job Failed, Run time 00:38:31, NODE_FAIL, ExitCode 0

The slurm output file does not have much info.

Discover Requeues this job with the following email subject line: Slurm Job_id=9555 Name=401_480.job Failed, Run time 00:38:31, NODE_FAIL, ExitCode 0 This re-queueing can cause repeated job failures if the temporary files remain on the node (different node that had failure from other job) and cannot be overwritten or deleted.

bidhya commented 1 year ago

The likely problem is memory-related even though this is not reported because the job is req-ued.

Possible solution: try running the v14.jl with less number of pixels (say step size of 70 on Discover)

bidhya commented 1 year ago

Also likey failing on skynode with has 36 cores. This translates to 36 x 4 GB = 144 GB even though 190 GB available on node.

bidhya commented 1 year ago

Use stepsize of ~40 or less. More the cores, lesser the stepsize.

bidhya commented 10 months ago

SBATCH --no-requeue

Adding this line on slurm job script prevents the automatic re-queueing of failed job, effectively fixing this issue.

bidhya commented 10 months ago

Again occurred this error when running on Milan nodes with 100 rows. Again due to memory overuse. Solution: reduce the number of rows to run on each mode.

bidhya commented 9 months ago

Name=801_950.job Failed, Run time 02:27:52, NODE_FAIL, ExitCode 0 Again memory error on Milan node even with garbage collection.