verdi computer test does not query jobs by user

ltalirz commented 5 years ago

@pzarabadip ran into an issue where verdi computer test would hang

The reason is that verdi computer test runs qstat -f on this computer, which simply produces enormous output. qstat -f -u<username> works just fine.

We should make sure that the username is passed here: https://github.com/aiidateam/aiida_core/blob/521b77824c0e066f5ba0f58045b98f7a0269b9ef/aiida/cmdline/commands/cmd_computer.py#L65

For comparison, see what is done here; https://github.com/aiidateam/aiida_core/blob/521b77824c0e066f5ba0f58045b98f7a0269b9ef/aiida/engine/processes/calcjobs/manager.py#L86-L91

ezpzbz commented 5 years ago

Hi @ltalirz , I just found an issue with this solution. I have submitted calculations and faced this issue that daemon pauses the process:

518  1h ago     NetworkCalculation   ⏸ Waiting        Pausing after failed transport task: update_calculation failed 5 times consecutively

I inspected the daemon log and found that it tries to get the job information but it fails:

 File "/storage/brno9-ceitec/home/pezhman/projects/git_repos/aiida-core-1.0.0b4/aiida/schedulers/plugins/pbsbaseclasses.py", line 404, in _parse_joblist_output
    raise SchedulerParsingError("I did not find the header for the first job")
aiida.schedulers.scheduler.SchedulerParsingError: I did not find the header for the first job

The reason is that qstat -f -u<username> does not produce the detailed information as the qstat -f does. It needs extra flag of -w to do so. The other point regarding this issue would be related to HPC centers with different servers like one of ours. In this case, job may be executed on a different server rather the default one and therefore, we would face the issue the that qstat receives empty log. Therefore, my current command line in https://github.com/aiidateam/aiida-core/blob/6344b8da6b65e420d8161da32dedafe2df9124d0/aiida/schedulers/plugins/pbsbaseclasses.py#L162

looks like:

command = ['qstat', '-f', '-w', '@<server1> @<server2> @<server3>']

I also did test the timings of qstat -f and qstat -f -w -u<username> which in my case are ~10s and ~0.1s, respectively.

ltalirz commented 5 years ago

@pzarabadip Thanks for the update!

The reason is that qstat -f -u does not produce the detailed information as the qstat -f does. It needs extra flag of -w to do so.

Interesting... which version of pbspro are you running? It seems this lower-case -w flag is not documented for older versions of pbspro (?) Or is this the same as -W?

The other point regarding this issue would be related to HPC centers with different servers like one of ours. In this case, job may be executed on a different server rather the default one and therefore, we would face the issue the that qstat receives empty log.

Ok, I guess this setup is not very common and we've never encountered it before. @giovannipizzi Do you think it makes sense to include some optional extra string (@<server1>) for the pbspro class?

ezpzbz commented 5 years ago

@ltalirz I am using the pbs_version = 19.0.0 which does not have -W. I just tried this solution on pbs_version = PBSPro_13.1.1.162303 and it did not work but luckily there the qstat -f is fast enough to do the job.

ltalirz commented 5 years ago

@pzarabadip Ok, it looks like v14.1 already has the -w flag: https://github.com/PBSPro/pbspro/blob/v14.1.0/doc/man1/qstat.1B#L63

Is it that the open source version has it while the closed-source version doesn't? We could add a new scheduler plugin pbspro-open or something like this

ezpzbz commented 5 years ago

@ltalirz Indeed, both versions have the -w flag in the help but the PBSPro_13.1.1.162303 does not take it into account as the other version and only prints list of jobs in long format.

aiidateam / aiida-core

verdi computer test does not query jobs by user #2977