aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
427 stars 187 forks source link

waiting for transport task: submit #6511

Closed SteDEr97 closed 2 months ago

SteDEr97 commented 2 months ago

Hello everyone! I am a new user of Quantum Mobile, and right now I am going through the tutorials of AiiDA. Right now I am having problems with the tutorial regarding external codes , i.e., Quantum ESPRESSO ( https://aiida-tutorials.readthedocs.io/en/latest/sections/running_processes/basics.html#). Towards the end of the tutorial, after submitting the job with the verdi shell, I should exit the verdi shell, and check the process list. When I do so, what I see as "Process status" is the following message: "waiting for transport task: submit". I have tried to stop the daemon, start it again, restart, add workers, verdi process play , but nothing seems to work. Moreover after some time the "Process status" becomes "Pausing after failed transport task: submit_calculation failed 5 times consecutively"

The steps that I have followed are the ones described in the tutorial: code = load_code() builder = code.get_builder() structure = load_node() builder.structure = structure pseudo_family = load_group('SSSP/1.1/PBE/efficiency') pseudos = pseudo_family.get_pseudos(structure=structure) builder.pseudos = pseudos parameters = { 'CONTROL': { 'calculation': 'scf', # self-consistent field }, 'SYSTEM': { 'ecutwfc': 30., # wave function cutoff in Ry 'ecutrho': 240., # density cutoff in Ry }, } builder.parameters = Dict(parameters) KpointsData = DataFactory('core.array.kpoints') kpoints = KpointsData() kpoints.set_kpoints_mesh([4,4,4]) builder.kpoints = kpoints builder.metadata.options.resources = {'num_machines': 2} from aiida.engine import submit calcjob_node = submit(builder)

After this, I don't receive any output, unless I type "calcjob_node"+ENTER or submit(builder)+ENTER. To the builder.metadata.options.resources I have given 2 as 'num_machines', because if I don't, the process finishes immediately with code 305 and the report stays mpirun is missing.

The Quantum Mobile version I am using is the 24.04.0, with AiiDA v2.4.3 RabbitMQ v3.8.2

Thank you!

sphuber commented 2 months ago

There is a problem with submitting the calculation. It tried 5 times and failed 5 times which is why the calculation is paused. You can run verdi process report PK to get more information as to the reason for the failure.

SteDEr97 commented 2 months ago
Hello! Thank you for your fast answer!! The report says this: " 557: CalcJobState.SUBMITTING Scheduler output: N/A Scheduler errors: N/A 12 LOG MESSAGES: +-> ERROR at 2024-07-03 18:55:46.313591+02:00 Traceback (most recent call last): File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry result = await coro() File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit return execmanager.submit_calculation(node, transport) File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 378, in submit_calculation result = scheduler.submit_from_script(workdir, submit_script_filename) File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 410, in submit_from_script return self._parse_submit_output(*result) File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 430, in _parse_submit_output raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}') aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1 stdout= stderr=sbatch: error: Batch job submission failed: Node count specification invalid
" Everything below the "ERROR" part is repeated another 4 times (I guess because it tries five times and then it stops trying). After that it says: "+-> WARNING at 2024-07-03 19:00:46.737682+02:00 maximum attempts 5 of calling do_submit, exceeded +-> ERROR at 2024-07-03 19:03:57.146867+02:00 Traceback (most recent call last): File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry result = await coro() File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit return execmanager.submit_calculation(node, transport) File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 378, in submit_calculation result = scheduler.submit_from_script(workdir, submit_script_filename) File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/scheduler.py", line 410, in submit_from_script return self._parse_submit_output(*result) File "/home/max/.conda/envs/aiida/lib/python3.9/site-packages/aiida/schedulers/plugins/slurm.py", line 430, in _parse_submit_output raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}') aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1 stdout= stderr=sbatch: error: Batch job submission failed: Node count specification invalid

" And then the "ERROR" is repeated other 4 times. And finally it says: "+-> WARNING at 2024-07-03 19:09:05.057112+02:00 | maximum attempts 5 of calling do_submit, exceeded"

I will leave a .txt file with the whole report. Thank you for your help! AiiDA_Waiting_ERROR.txt

sphuber commented 2 months ago

This is the relevant error:

stderr=sbatch: error: Batch job submission failed: Node count specification invalid

You specified to use 2 nodes (num_machines: 2) and I am guessing your computer does not have multiple nodes or SLURM is configured to not allow multiple node jobs. You should revert num_machines back to 1.

I have given 2 as 'num_machines', because if I don't, the process finishes immediately with code 305 and the report stays mpirun is missing.

The original problem was that the calculation was not run with MPI enabled. You can do so by setting builder.metadata.options.withmpi = True. That should enable MPI and hopefully the calculation should run.

SteDEr97 commented 2 months ago

It works!! Thank you very much! :smiley: