ArjunaCluster / ArjunaUsers

Arjuna Public Documentation for Users
https://arjunacluster.github.io/ArjunaUsers/
14 stars 7 forks source link

I/O error #20

Closed yaomz16 closed 3 years ago

yaomz16 commented 3 years ago

I submitted a gpu job successfully, but the job failed at once, and I get this error message in StdErr file: slurmstepd: error: execve(): /tmp/slurmd/job2997813/slurm_script: No such file or directory

awadell1 commented 3 years ago

Can you fill in the "Performance Issue" info? (Here in this issue)

Your Name: Your Andrew ID: Node(s) on which the problem occurred: Expected Behavior: Observed Behavior: Location of Log file Showing the Error: Location of Script showing [Minimum Working Example]:

yaomz16 commented 3 years ago

Sure!

aabills commented 3 years ago

This looks like the nfs issue again. Try it again and see if it works

aabills commented 3 years ago

Add -w c002

yaomz16 commented 3 years ago

I posted a new issue just now

yaomz16 commented 3 years ago

This looks like the nfs issue again. Try it again and see if it works

Nope, my job still fails because of the same reason

awadell1 commented 3 years ago

Consolidating Info

Your Name: "Archie" Mingze Yao Your Andrew ID: mingzeya Node(s) on which the problem occurred: c002 Expected Behavior: Job running normally Observed Behavior: Failed at once Location of Log file Showing the Error:/home/mingzeya/Phase_field_project/multi_geometries/python_impl/all_constants_changed/case1/restart_at_epoch_45/restart_at_epoch_60/restart_at_epoch_115/restart_at_epoch_165/error_2997814.err Location of Script showing [Minimum Working Example]:/home/mingzeya/Phase_field_project/multi_geometries/python_impl/all_constants_changed/case1/restart_at_epoch_45/restart_at_epoch_60/restart_at_epoch_115/restart_at_epoch_165/train.py

Please also attach any logs and the submission script to this issue.

I get the following error in the StdErr file: "slurmstepd: error: execve(): /tmp/slurmd/job2997814/slurm_script: No such file or directory"

If you are not a frequent github user, please also provide us with a contact email here: Contact Email: amyao@cmu.edu

emilannevelink commented 3 years ago

FWIW, I'm getting the same error when submitting a CPU job

Your Name: Emil Your Andrew ID: eannevel Node(s) on which the problem occurred: f001 Expected Behavior: normal run Observed Behavior: exited within 10 seconds Location of Log file Showing the Error: /home/eannevel/ARPA-E/slabmol/logs/error.2997816 Location of Script showing [Minimum Working Example]: no working example

awadell1 commented 3 years ago

Where are your sbatch scripts? @yaomz16 @emilannevelink

aabills commented 3 years ago

can reproduce. Interactive jobs seem fine though

emilannevelink commented 3 years ago

My script is at /home/eannevel/ARPA-E/slabmol/scripts/mol_gpaw_MD.sh

yaomz16 commented 3 years ago

My script is at /home/mingzeya/Phase_field_project/multi_geometries/python_impl/all_constants_changed/case1/restart_at_epoch_45/restart_at_epoch_60/restart_at_epoch_115/restart_at_epoch_165/TrainGPU.sh

aabills commented 3 years ago

This should be fixed