Open ltalirz opened 5 years ago
Hi @ltalirz , I just found an issue with this solution. I have submitted calculations and faced this issue that daemon pauses the process:
518 1h ago NetworkCalculation ⏸ Waiting Pausing after failed transport task: update_calculation failed 5 times consecutively
I inspected the daemon log and found that it tries to get the job information but it fails:
File "/storage/brno9-ceitec/home/pezhman/projects/git_repos/aiida-core-1.0.0b4/aiida/schedulers/plugins/pbsbaseclasses.py", line 404, in _parse_joblist_output
raise SchedulerParsingError("I did not find the header for the first job")
aiida.schedulers.scheduler.SchedulerParsingError: I did not find the header for the first job
The reason is that qstat -f -u<username>
does not produce the detailed information as the qstat -f
does. It needs extra flag of -w
to do so.
The other point regarding this issue would be related to HPC centers with different servers like one of ours. In this case, job may be executed on a different server rather the default one and therefore, we would face the issue the that qstat
receives empty log. Therefore, my current command line in
https://github.com/aiidateam/aiida-core/blob/6344b8da6b65e420d8161da32dedafe2df9124d0/aiida/schedulers/plugins/pbsbaseclasses.py#L162
looks like:
command = ['qstat', '-f', '-w', '@<server1> @<server2> @<server3>']
I also did test the timings of qstat -f
and qstat -f -w -u<username>
which in my case are ~10s and ~0.1s, respectively.
@pzarabadip Thanks for the update!
The reason is that qstat -f -u
does not produce the detailed information as the qstat -f does. It needs extra flag of -w to do so.
Interesting... which version of pbspro are you running?
It seems this lower-case -w
flag is not documented for older versions of pbspro (?)
Or is this the same as -W
?
The other point regarding this issue would be related to HPC centers with different servers like one of ours. In this case, job may be executed on a different server rather the default one and therefore, we would face the issue the that qstat receives empty log.
Ok, I guess this setup is not very common and we've never encountered it before.
@giovannipizzi Do you think it makes sense to include some optional extra string (@<server1>
) for the pbspro class?
@ltalirz I am using the pbs_version = 19.0.0
which does not have -W
.
I just tried this solution on pbs_version = PBSPro_13.1.1.162303
and it did not work but luckily there the qstat -f
is fast enough to do the job.
@pzarabadip Ok, it looks like v14.1 already has the -w
flag:
https://github.com/PBSPro/pbspro/blob/v14.1.0/doc/man1/qstat.1B#L63
Is it that the open source version has it while the closed-source version doesn't?
We could add a new scheduler plugin pbspro-open
or something like this
@ltalirz Indeed, both versions have the -w
flag in the help but the PBSPro_13.1.1.162303
does not take it into account as the other version and only prints list of jobs in long format.
@pzarabadip ran into an issue where
verdi computer test
would hangThe reason is that
verdi computer test
runsqstat -f
on this computer, which simply produces enormous output.qstat -f -u<username>
works just fine.We should make sure that the username is passed here: https://github.com/aiidateam/aiida_core/blob/521b77824c0e066f5ba0f58045b98f7a0269b9ef/aiida/cmdline/commands/cmd_computer.py#L65
For comparison, see what is done here; https://github.com/aiidateam/aiida_core/blob/521b77824c0e066f5ba0f58045b98f7a0269b9ef/aiida/engine/processes/calcjobs/manager.py#L86-L91