Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out

Zeleznyj commented 5 years ago

I've encountered an issue where sometimes a calculation will show as finished in aiida, but the actual calculation on remote computer is still running. Aiida will retrieve the files and run parser without showing any error. This happened with ssh transport and slurm scheduler. I'm not sure if the problem is necessarily related to slurm though, since we are now not using other schedulers much. The calculations are using our own FPLO calculation plugins. It is possible that the issue is somehow related to some problem in the plugins, but to me it seems like a problem with aiida, since everything on our side is working fine. The calculation is submitted correctly and finishes correctly, the only problem is that the results are retrieved before the remote calculation is finished. This thus looks like a problem with parsing the queue status. The problem happens randomly, when we resubmit a calculation, it will usually finish fine.

I've noticed the problem after checking out the develop branch couple days ago, but most likely the problem existed also before, when I was using the 1.0.06b version.

I can try to include more details, but I'm not sure where to start about debugging this.

sphuber commented 5 years ago

So just to be sure, you are using the standard slurm plugin that ships with aiida-core?

Are all the files you expect present in the retrieved folder? You can check by first getting the corresponding pk from verdi process show. The retrieved node should be listed under outputs. Then with verdi node repo ls you can show its contents. Are the files in there complete or do they contain only partial output. Could you also share the output of verdi node attributes <calcjob_pk>?

Zeleznyj commented 5 years ago

Yes, I'm using the default slurm plugin.

All files in the retrieved node are present, but they contain only partial output. When I use verdi calcjob gotocomputer to check the remote, there the files contain full output, however, the retrieved files are not complete since they have been retrieved before the calculation has finished.

Here is the output of verdi node attributes command:

{
    "append_text": "",
    "custom_scheduler_commands": "",
    "environment_variables": {},
    "exit_status": 0,
    "import_sys_environment": true,
    "job_id": "176740",
    "last_jobinfo": "{\"job_id\": \"176740\", \"wallclock_time_seconds\": 1456, \"title\": \"aiida-5941\", \"num_machines\": 1, \"detailedJobinfo\": null, \"queue_name\": \"newnodes\", \"num_mpiprocs\": 1, \"allocated_machines_raw\": \"node98\", \"submission_time\": {\"date\": \"2019-10-09T14:05:39.000000\", \"timezone\": null}, \"job_owner\": \"zeleznyj\", \"annotation\": \"None\", \"requested_wallclock_time_seconds\": 259200, \"raw_data\": [\"176740\", \"R\", \"None\", \"node98\", \"zeleznyj\", \"1\", \"1\", \"node98\", \"newnodes\", \"3-00:00:00\", \"24:16\", \"2019-10-09T14:05:40\", \"aiida-5941\", \"2019-10-09T14:05:39\"], \"job_state\": \"running\", \"dispatch_time\": {\"date\": \"2019-10-09T14:05:40.000000\", \"timezone\": null}}",
    "max_memory_kb": 3000000,
    "max_wallclock_seconds": 259200,
    "mpirun_extra_params": [],
    "output_filename": "fplo_out",
    "parser_name": "fplo",
    "prepend_text": "",
    "process_label": "FploCalculation",
    "process_state": "finished",
    "queue_name": "newnodes",
    "remote_workdir": "/zfs_data/zeleznyj/aiida_pc109/5f/1b/8ab9-7abe-4680-8b1b-7ba964bb5946",
    "resources": {
        "default_mpiprocs_per_machine": 24,
        "num_machines": 1,
        "num_mpiprocs_per_machine": 1
    },
    "retrieve_list": [
        "fplo_out",
        "=.dens",
        "=.in",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [],
    "scheduler_lastchecktime": "2019-10-09T12:30:12.371370+00:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "version": {
        "core": "1.0.0b6",
        "plugin": "0.5.5"
    },
    "withmpi": false
}

sphuber commented 5 years ago

Is there anything peculiar in the content of the scheduler output files? You can display those with verdi calcjob outputcat <PK> _scheduler-stderr.txt and verdi calcjob outputcat <PK> _scheduler-stdout.txt.

Does this happen for all calculations of all types? Or only of this type? Or even only for some calculations of this type?

It would be good if you have no calculations running anymore to set verdi config logging.aiida_loglevel DEBUG followed by verdi daemon restart --reset. Then submit 1 calculation and run verdi daemon logshow and copy the output printed there while that calculation is running to a file and paste it here.

Zeleznyj commented 5 years ago

The scheduler output files are empty.

We have seen this happening with one calculation type only and with one remote computer. However, this doesn't have to mean anything since vast majority of calculations we have done so far were with this calculation and remote computer.

It seems to be happening randomly with most calculations having no issue. If I repeat exactly the same calculation where the problem occurred it will usually finish fine.

I will try to turn on the debugging and will post the output once the problem occurs.

Zeleznyj commented 5 years ago

I have managed to reproduce the issue again, so I can provide some more info. The issue happens also with different remote computer, with the pbspro scheduler and with a different calculation plugin. We used a slightly modified version of the pbspro scheduler, but I don't think the issue is related to the scheduler since the issue seems to be exactly the same with the slurm scheduler: Aiida at some points starts to retrieve the calculation even though the calculation is still running or in the queue.

I was unable to reproduce the issue at first, it seems to be quite random, however, now I've run hundreds of calculations and can see the issue in some. Because of this, the whole log file is massive, so I'm uploading only parts relevant to some of the calculation where the issue occurred. I can provide the whole file if necessary, but it has over 1GB. Even the cut parts are still quite large so I put them in dropbox. The files are named according to the process pk for which the process occurred.

I often see the following error in verdi process report

+-> ERROR at 2019-10-17 22:42:48.875613+00:00
 | Traceback (most recent call last):
 |   File "/jakzel/aiida/aiida_code/aiida-core/aiida/engine/utils.py", line 176, in exponential_backoff_retry
 |     result = yield coro()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
 |     yielded = self.gen.throw(*exc_info)
 |   File "/jakzel/aiida/aiida_code/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 175, in do_update
 |     job_info = yield cancellable.with_interrupt(update_request)
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
 |     yielded = self.gen.throw(*exc_info)
 |   File "/jakzel/aiida/aiida_code/aiida-core/aiida/engine/utils.py", line 93, in with_interrupt
 |     result = yield wait_iterator.next()  # pylint: disable=stop-iteration-return
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/jakzel/aiida/virtualenvs/aiida/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "<string>", line 3, in raise_exc_info
 | SchedulerError: Error during squeue parsing (_parse_joblist_output function)

but it seems to be unrelated since it also happens with calculations that finish fine, and not all calculations that have the issue have this error.

Zeleznyj commented 5 years ago

Are there some things I could try to fix this issue or to investigate more?

It seems that we only see this issue on one particular computer (though it's hard to be sure because of the randomness), which is running on Debian 10, whereas otherwise we have been using computers running Ubuntu. Is it possible that the issue is related to the operating system, or to version of some package like RabbitMQ? I can try to experiment with this, but it's quite time consuming, because it's necessary to run a lot of calculations to verify whether the issue is there.

sphuber commented 5 years ago

Sorry for not responding earlier @Zeleznyj I have been swamped with the release of aiida-core==1.0.0 of next week. I will try to have a look after that, but I am afraid that as it is already difficult for you to reproduce, even moreso for me. I will have to base a potential cause on speculation and then see if there is anyway to probe/test this. I will think about it and come back to you with hopefully something that you can test.

ltalirz commented 5 years ago

@danieleongari at LSMO seems to run into the same issue, i.e. this seems confirmed. We're investigating to get an idea of what could be going wrong...

ltalirz commented 5 years ago

@Zeleznyj thanks for your report.

Aiida at some points starts to retrieve the calculation even though the calculation is still running or in the queue.

Just to be absolutely sure (since this is important): You are definitely positive that the calculations were still running (also shown as such in slurm), rather than the problem being that the calculations were finished correctly but then AiiDA, for some reason, retrieved the files incompletely? That would indeed be very helpful information.

ltalirz commented 5 years ago

If the issue turns out to be related to the scheduler receiving too many requests, and AiiDA misinterpreting its output (or it gets hanging for some reason), it may help to increase the safe_interval using verdi computer configure ssh <computername> (see current value in verdi computer configure show <computername>). The default value is 5s (per daemon worker), which could lead to issues if you have a lot of workers active.

Of course, this would just be a workaround, not a solution of the underlying problem.

P.S. The safe_interval limits the overall rate of SSH connections per worker, which is not strictly necessary here. I would be enough to just increase the interval between polls of the scheduler - there is no cli for it yet (we should add this #3541) but you can do it from the verdi shell:

In [1]: c = load_computer(1)

In [2]: c.get_minimum_job_poll_interval()
Out[2]: 10.0

In [3]: c.set_minimum_job_poll_interval(30)

Zeleznyj commented 5 years ago

The issue is definitely that the calculations are still running and are shown as such in slurm or pbspro when aiida retrieves them. I've seen it hundreds of times now.

I have in fact increased the safe interval to 60s for one of the computers we use. This didn't seem to have any influence on the issue as far as I can say. It is definitely still present.This is a computer that uses pbspro and the qstat command is sometimes responding very slowly. Interestingly, it seems that the qstat command is only slow when I have aiida calculations running on this computer, though this may be a coincidence. Is it possible that even with increasing the safe_interval to 60s, aiida is flooding the pbspro server with too many requests? It seems strange to me since this is a large supercomputer with many users and hundreds of jobs in the queue.

I can try to increase the safe_interval even more.

The other computer we use is a smaller cluster, which uses slurm. There the slurm is very responsive, but sometimes the ssh connection to the cluster is unstable and can drop out suddenly.

Perhaps I should also mention that the issue persists when upgrading to aiida v1.0.0 and that I'm using python2.

ltalirz commented 5 years ago

Thanks @Zeleznyj, this is really helpful to know and we absolutely need to look into this.

I have in fact increased the safe interval to 60s for one of the computers we use. This didn't seem to have any influence on the issue as far as I can say.

Hm... can you try what I suggested above (to increase the minimum job poll interval) instead (increase e.g. to 120s - or some value that is reasonable given the typical duration of your calculations)? I happen to know that @sphuber , who has not been experiencing this issue in ~5000 calculations, has increased this value.

@sphuber I wonder: What does currently happen when you have a very low minimum_job_poll_interval and a high safety_interval (say, 10x higher). When the next SSH connection is allowed, is it possible that AiiDA tries to poll the scheduler 10 times within the same connection?

Also, is the status obtained from the scheduler shared between workers?

sphuber commented 5 years ago

@sphuber I wonder: What does currently happen when you have a very low minimum_job_poll_interval and a high safety_interval (say, 10x higher). When the next SSH connection is allowed, is it possible that AiiDA tries to poll the scheduler 10 times within the same connection? Also, is the status obtained from the scheduler shared between workers?

No, there is no sharing of information between workers on this level. Each worker has an instance of a JobManager which takes care of polling the scheduler. The processes run by a daemon worker, will request an update of their status with this JobManager which will poll the scheduler, respecting the minimum_job_interval set on the given computer. To do so, it will first request a transport from the TransportQueue. This latter one ensure that the safe_interval is respected. So in the worst case, a new scheduler update will occur every safe_interval + minimum_job_interval seconds.

Zeleznyj commented 5 years ago

I have tried to increase the minimum_job_poll_interval to 120 s and I still see the issue.

I have also reproduced the issue on a different computer, which is running on Ubuntu 18.04 instead of Debian.

sphuber commented 5 years ago

could you please run verdi config logging.aiida_loglevel INFO and restart the daemon. This should increase the logging level and print messages in the daemon log when the engine starts the retrieval of a certain calculation job. When you encounter another problematic run, try to find its pk in the log. There should be messages where it states that it is starting to retrieve. Please compare the timestamps of those log messages with the output of ls on the remote working directory to confirm that retrieval starts before the output files are completely written.

Zeleznyj commented 5 years ago

I will try this, but I'm sure that the retrieval starts before the output files are completely written since I routinely have to kill jobs that are running on the remote computer or are even still in the queue, which are finished within aiida.

greschd commented 5 years ago

As an additional data point, I think I've had this happen when the input given was "wrong" in some sense (in my case, an empty k-points mesh for a QE pw.x calculation). I didn't follow up on the error because it was just wrong input, but maybe this means that such a state can happen if there is a curious exception while submitting the calculation?

@Zeleznyj are these calculations retrieved immediately after submission, or only after some time?

sphuber commented 5 years ago

@Zeleznyj I am working on a branch that will improve the accounting with SLURM and add logging that hopefully will help in pinpointing the problem. You are using SLURM correct? I will hopefully be able to send it to you tomorrow for you to try out.

sphuber commented 5 years ago

@Zeleznyj please check out this branch from my fork, restart the daemon and continue launching. When you spot a problem again, please report here the output of verdi process report, verdi node attributes and the contents of the daemon log for pk of the calculation that retrieved too early. Thanks a lot

Zeleznyj commented 5 years ago

I will try.

I'm using two remote computers, one is using SLURM and the other PBSPRO. I see the issue on both so I can test now with SLURM.

@greschd The calculation are retrieved seemingly randomly after some time and in my case this is not connected to wrong input.

Zeleznyj commented 5 years ago

@sphuber I tried running calculation on your branch, but all of the calculations I run get stuck at 'Waiting for transport task: accounting stage', this is the corresponding error from daemon log:

Traceback (most recent call last):
  File "/home/kuba/bin/aiida/aiida/engine/utils.py", line 176, in exponential_backoff_retry
    result = yield coro()
  File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 239, in do_accounting
    job_accounting = scheduler.get_job_accounting(job_id)
  File "/ho  File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 239, in do_accounting
    job_accounting = scheduler.get_job_accounting(job_id)
  File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 757, in get_job_accounting
    return self._parse_detailed_jobinfo(stdout)
  File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 773, in _parse_detailed_jobinfo
    self.logger.warning('scheduler job accounting details is empty so probably accounting is not enabled.')
NameError: global name 'self' is not defined

sphuber commented 5 years ago

Sorry about that, please go into aiida.schedulers.plugins.slurm and on line 759 remove the @staticmethod decorator and then add self as a method argument, i.e. it should become def _parse_detailed_jobinfo(self, jobinfo_details). Then restart the daemon and try again. This already shows me though that unfortunately the sacct command on your SLURM scheduler is not enabled. This makes it difficult to get the additional information we need.

Zeleznyj commented 5 years ago

Here is the info for a job where the issue occured:

verdi node report:

*** 14720: None
*** (empty scheduler output file)
*** Scheduler errors:
forrtl: severe (24): end-of-file during read, unit -4, file /proc/16945/fd/0
Image              PC                Routine            Line        Source             
fplo18.00-53-x86_  00000000010FE725  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000929CB2  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000B0F4BF  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D4EAD1  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D5762A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406896  Unknown               Unknown  Unknown
fplo18.00-53-x86_  00000000011EFAB0  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406757  Unknown               Unknown  Unknown

*** 2 LOG MESSAGES:
+-> WARNING at 2019-11-21 15:48:17.173206+00:00
 | Job info not present, assuming it was completed
+-> ERROR at 2019-11-21 15:47:35.221711+00:00
 | Traceback (most recent call last):
 |   File "/home/kuba/bin/aiida/aiida/engine/utils.py", line 176, in exponential_backoff_retry
 |     result = yield coro()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1069, in run
 |     yielded = self.gen.send(value)
 |   File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 125, in do_submit
 |     raise Return(execmanager.submit_calculation(node, transport, calc_info, script_filename))
 |   File "/home/kuba/bin/aiida/aiida/engine/daemon/execmanager.py", line 284, in submit_calculation
 |     job_id = scheduler.submit_from_script(workdir, script_filename)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/scheduler.py", line 408, in submit_from_script
 |     return self._parse_submit_output(retval, stdout, stderr)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 473, in _parse_submit_output
 |     'stdout={}\nstderr={}'.format(retval, stdout, stderr))
 | SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
 | sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
 |

verdi node attributes:

{
    "append_text": "",
    "custom_scheduler_commands": "",
    "environment_variables": {},
    "exit_status": 0,
    "import_sys_environment": true,
    "job_accounting": {},
    "job_id": "190074",
    "max_memory_kb": 3000000,
    "max_wallclock_seconds": 259200,
    "mpirun_extra_params": [],
    "output_filename": "fplo_out",
    "parser_name": "fplo",
    "prepend_text": "",
    "process_label": "FploCalculation",
    "process_state": "finished",
    "queue_name": "bignodes",
    "remote_workdir": "/zfs_data/zeleznyj/aiida/9f/15/8c21-ce91-4189-a831-a49806bf0423",
    "resources": {
        "default_mpiprocs_per_machine": 24,
        "num_machines": 1,
        "num_mpiprocs_per_machine": 1
    },
    "retrieve_list": [
        "fplo_out",
        "=.dens",
        "=.in",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [],
    "scheduler_lastchecktime": "2019-11-21T15:48:17.185719+00:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "version": {
        "core": "1.0.0",
        "plugin": "0.7.3"
    },
    "withmpi": false
}

I also attach the relevant part of daemon log. 14720.log

I don't understand what the scheduler error means. We see this sometimes, I'm not sure if this is related, but I saw the same issue also in cases where this error was not present. Despite this error, the calculation is running fine on the remote. At this time it is still running, so it is definitely not the case that aiida would just be retrieving incomplete files.

sphuber commented 5 years ago

The scheduler error should not pose a problem, especially given that you have seen this with other jobs where the retrieval is fine. The engine simply tried to submit the job by calling sbatch but it failed. However, the engine caught the problem and simply rescheduled it and the second try was successful. The problem is then afterwards. You can see the warning message that I put in the report. It says that when it queried the status of all jobs, this particular one was missing. This is not necessarily problematic as it is also what happens for calculations that actually complete and it is when we assume that it must have completed. There seems to be a problem then that the job id for this job was missing from the scheduler update response, but for the wrong reasons.

Just to be sure, the message we see in the scheduler stderr:

forrtl: severe (24): end-of-file during read, unit -4, file /proc/16945/fd/0

do you also see this in jobs that run just fine? As in, we should not have to worry about this? I take it it comes from the code itself that you are running.

The only thing I can think of now to further debug this is to add something in the prepend and append texts of this code to echo the job id. For example

echo $SLURM_JOBID > prepend.job_id.txt

and

echo $SLURM_JOBID > append.job_id.txt

In the prepend and append texts respectively. This will echo the job id when it actually runs to the two files, which we can then later check for a job with this failure. This is just to cross-verify that the job_id that get's stored in the node's attributes after submission is the same one that actually finally runs. Maybe a switch happens for some reason such that the engine actually checks the wrong job id. You can add these lines for the append and prepend text on a per calculation basis through the options (so you don't have to configure a new code), but that choice is up to you.

As a last thing, I would maybe add one line of debugging in the job manager to print the actual response of the scheduler when we request the state of the jobs. I will see if I can add this to the branch tomorrow and let you know so you can try it out.

Thanks a lot for all the detailed reports so far and you patience :+1:

ltalirz commented 5 years ago

I would maybe add one line of debugging in the job manager to print the actual response of the scheduler when we request the state of the jobs.

I think that's crucial to debugging this issue (and perhaps we could even make it possible to get this at DEBUG level in aiida-core)

sphuber commented 4 years ago

I added another commit to my branch fix_2431_scheduler_parsing with a DEBUG log message containing the raw output returned by SLURM when calling squeue. Could you please check it out, set logging to DEBUG, restart the daemon (ideally with single worker) and attempt to reproduce it. The daemon log now should contain rows like:

11/22/2019 04:02:11 PM <22393> aiida.scheduler.slurm: [DEBUG] Results of `_get_joblist_command`:
retval: 0
stderr: 
stdout: 
18300045^^^PD^^^Priority^^^n/a^^^sphuber^^^1^^^12^^^(Priority)^^^normal^^^10:00^^^0:00^^^N/A^^^aiida-116464^^^2019-11-22T16:00:44
18300091^^^PD^^^Priority^^^n/a^^^sphuber^^^1^^^12^^^(Priority)^^^normal^^^10:00^^^0:00^^^N/A^^^aiida-116468^^^2019-11-22T16:01:58

Hopefully we can correlate the job id of a failed calculation to the output we see here, i.e. if it is there before and after retrieval started.

Also please launch your calculations with the following option in the inputs:

inputs = {
....
    'metadata': {
        'options': {
            ...
            'prepend_text': 'echo $SLURM_JOB_ID > prepend.jobid',
            'append_text': 'echo $SLURM_JOB_ID > append.jobid',
        }
    }
}

Zeleznyj commented 4 years ago

I will try this out.

Regarding the error in stderr, I now think it may be related. I did not think it's related before because the error is not present in all calculations which have the issue. However, I now went through all of the previous calculations and I found that none of the calculations that finish correctly have the error. So maybe there are two issues or two different origins for the issue. I will continue testing.

Zeleznyj commented 4 years ago

It seems something is amiss with the jobids. I've encoutered another calculation where there's error in stderr, but the calculation finishes correctly. The jobid in the prepend.jobid is 191160, which agrees with the jobid in the attributes, whereas the jobid in the append.jobid is 191147.

I've checked now that in the previous case I mentioned, there also the jobid of the running job was different from the one in attributes.

Here are the attributes:

{
    "append_text": "echo $SLURM_JOB_ID > append.jobid",
    "custom_scheduler_commands": "",
    "environment_variables": {},
    "exit_message": "Calculation did not produce all expected output files.",
    "exit_status": 100,
    "import_sys_environment": true,
    "job_accounting": {},
    "job_id": "191160",
    "max_memory_kb": 3000000,
    "max_wallclock_seconds": 259200,
    "mpirun_extra_params": [],
    "output_filename": "fplo_out",
    "parser_name": "fplowannier",
    "prepend_text": "echo $SLURM_JOB_ID > prepend.jobid",
    "process_label": "FploWannierCalculation",
    "process_state": "finished",
    "queue_name": "bignodes",
    "remote_workdir": "/zfs_data/zeleznyj/aiida/43/9b/417d-5312-44d2-9e88-aca2a05edd20",
    "resources": {
        "default_mpiprocs_per_machine": 24,
        "num_machines": 1,
        "num_mpiprocs_per_machine": 1
    },
    "retrieve_list": [
        "fplo_out",
        "+hamdata",
        "+band",
        "+wanband",
        "+wanbandtb",
        "+points",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [],
    "scheduler_lastchecktime": "2019-11-25T11:53:17.083456+00:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "version": {
        "core": "1.0.0",
        "plugin": "0.7.3"
    },
    "withmpi": false
}

Here is the report:

*** 17432: None
*** (empty scheduler output file)
*** Scheduler errors:
forrtl: severe (24): end-of-file during read, unit 57, file /zfs_data/zeleznyj/aiida/43/9b/417d-5312-44d2-9e88-aca2a05edd20/+wancoeff
Image              PC                Routine            Line        Source
fplo18.00-53-x86_  00000000010FE725  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000CB328A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000C613FA  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D50362  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D5762A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406896  Unknown               Unknown  Unknown
fplo18.00-53-x86_  00000000011EFAB0  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406757  Unknown               Unknown  Unknown
forrtl: severe (24): end-of-file during read, unit 57, file /zfs_data/zeleznyj/aiida/43/9b/417d-5312-44d2-9e88-aca2a05edd20/+wancoeff
Image              PC                Routine            Line        Source
fplo18.00-53-x86_  00000000010FE725  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000CB328A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000C613FA  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D50362  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000D5762A  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406896  Unknown               Unknown  Unknown
fplo18.00-53-x86_  00000000011EFAB0  Unknown               Unknown  Unknown
fplo18.00-53-x86_  0000000000406757  Unknown               Unknown  Unknown

*** 4 LOG MESSAGES:
+-> ERROR at 2019-11-25 11:53:33.786617+00:00
 | parser returned exit code<100>: Calculation did not produce all expected output files.
+-> ERROR at 2019-11-25 11:53:33.782464+00:00
 | Found files '[u'+points', u'_scheduler-stderr.txt', u'_scheduler-stdout.txt', u'fplo_out']', expected to find '[u'fplo_out', '+hamdata']'
+-> WARNING at 2019-11-25 11:53:17.075818+00:00
 | Job info not present, assuming it was completed
+-> ERROR at 2019-11-25 11:51:43.678288+00:00
 | Traceback (most recent call last):
 |   File "/home/kuba/bin/aiida/aiida/engine/utils.py", line 176, in exponential_backoff_retry
 |     result = yield coro()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1055, in run
 |     value = future.result()
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/concurrent.py", line 238, in result
 |     raise_exc_info(self._exc_info)
 |   File "/home/kuba/virtualenvs/aiidapy/local/lib/python2.7/site-packages/tornado-4.5.3-py2.7-linux-x86_64.egg/tornado/gen.py", line 1069, in run
 |     yielded = self.gen.send(value)
 |   File "/home/kuba/bin/aiida/aiida/engine/processes/calcjobs/tasks.py", line 125, in do_submit
 |     raise Return(execmanager.submit_calculation(node, transport, calc_info, script_filename))
 |   File "/home/kuba/bin/aiida/aiida/engine/daemon/execmanager.py", line 284, in submit_calculation
 |     job_id = scheduler.submit_from_script(workdir, script_filename)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/scheduler.py", line 408, in submit_from_script
 |     return self._parse_submit_output(retval, stdout, stderr)
 |   File "/home/kuba/bin/aiida/aiida/schedulers/plugins/slurm.py", line 473, in _parse_submit_output
 |     'stdout={}\nstderr={}'.format(retval, stdout, stderr))
 | SchedulerError: Error during submission, retval=1
 | stdout=
 | stderr=sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
 | sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
 |

I forgot to change the debug status to DEBUG, so I don't have the additional debug output.

sphuber commented 4 years ago

Ok very good, this is what I was expecting. Clearly the job id that is returned by SLURM to the engine after the submit call is different then the one that is actually eventually being run by SLURM. This is why the engine incorrectly thinks the job is done and starts retrieving it. Since it has the wrong job id, when it asks the status the scheduler says it does not exist and so the engine assume it has already completed, starting the retrieval. So the question is now why does the engine get the wrong job id. There is two possibilities:

The engine parsing of the SLURM response of submit call is incorrect and it parses wrong job id
SLURM returns the incorrect job id

My idea is that it indeed has to do with the submit error we are seeing. Maybe the submission fails and the engine incorrectly parses a job id regardless. We will have to see what the exact response from SLURM is when the engine calls sbatch. I will see a bit later what you should do to determine this and come back to you

Zeleznyj commented 4 years ago

Isn't it possible that the first submit in fact succeeds, but aiida thinks it has failed? Then it submits again, which immediately crashes.

I've tested this out and I think this is possible. I have submitted two identical jobs directly on the remote and the second one indeed crashes with the same error message in the stderr. This would also explain why the calculation continues even though there is a severe error in stderr.

sphuber commented 4 years ago

I don't see why submitting the same submit script twice would fail. If you have an _aiidasubmit.sh and you call twice sbatch _aiidasubmit.sh you are saying that fails? What is the error message in that case?

Here is what I think is happening:

Engine calls submit on the script and SLURM return an error but it actually submitted it anyway (this is job 191147 and it actually starts running on the remote, it will write this to the prepend.jobid)
The engine got an exception on submit and so it tries again
The second submit succeeds and the engine records the job id 191160 that is returned by SLURM
The second job starts running and writes 191160 to prepend.jobid and then calls the actual code which fails because the original input files have already been touched by 191147. This is the error you see in the stderr, i.e. forrtl: severe (24): end-of-file during read, unit 57, file /zfs_data/zeleznyj/aiida/43/9b/417d-5312-44d2-9e88-aca2a05edd20/+wancoeff. The job crashes there. This also explains why append.jobid still shows 191147 because the second job crashes before it gets the chance to override it.
The second job 191160 now no longer exists but the original one 191147 is happily running and will eventually finish.
However, the engine thinks the actual job id is 191160 and when it checks for the status it is no longer there, so it assumes it is done and starts retrieving the files.

To confirm, could you try and figure out the "submission time" of job 191147? My bet is that this will correspond very closely to 2019-11-25 11:51:43.678288+00:00 which is the timestamp that the engine got the exception from the SLURM submit call. Normally you can get the submission time of a job from scontrol show jobid -dd 191147.

So the question now is why SLURM returns an exception on the submit call while in fact it did manage to submit it. I don't have an idea how to test this further, but maybe you can already add an additional

Zeleznyj commented 4 years ago

The scenario you are describing is precisely what I had in mind. If I call sbatch _aiidasubmit.sh twice, it gets submitted successfully both times. The first job will continue running and the second one will crash immediately with the same error in stderr as I posted.

I don't know how to get the submission time. The scontrol command works I think only for running jobs and this job is now finished. As far as I know, the command that should be used for this is saact, but this doesn't work for me on the remote, either something is misconfigured or root is needed.

Unfortunately, this doesn't explain all the cases of the issue for me so there are probably two separate issues. I also see that aiida sometimes retrieves calculations that are still running in cases when there is no submission error. I also see this behavior on a computer with pbspro. I will continue testing.

zhubonan commented 4 years ago

Hi, just to add that I am having a similar issue with the SLURM scheduler.

In my case, the job id was recorded correctly. I suspect dropping network connection may be a factor. I looked at the system log and aiida thinks the calculation is finished just after I unplugged my laptop and started to move with it to a different room. In fact, I have two calculation running at the same time and both were marked as finished. I checked the slurm.py but it seems that the return code of the squeue command is checked with parsing so network lost should be handled...

In general, I think scheduler just may not be that reliable. I had problems with PBSPro on another computer in the past. Would it be possible for aiida to have a confirmation check of the missing calculation rather than deciding they are finished immediately?

BTW I did not notice any problem with the 1.0.0b versions (or maybe I was lucky).

Zeleznyj commented 4 years ago

I have encountered another case of the issue, which is definitely not caused by the submission error and job being submitted twice. In this case the job was in the queue for several hours and then it was retrieved by aiida, even though it's still in the queue. In this case jobid is correct. This is running on a different computer that uses pbspro, but we see a similar behavior also with Slurm, or at least we definitely did in the past.

This is something we see quite frequently. Usually happens for several jobs at the same time, but not for all. I don't think it happens only for queued jobs but also for running jobs, however, I'm not completely sure. I will verify.

I'm attaching the attributes output:

PK: 142009
{
    "account": "OPEN-16-36",
    "append_text": "",
    "custom_scheduler_commands": "",
    "environment_variables": {},
    "exit_message": "Calculation did not produce all expected output files.",
    "exit_status": 100,
    "import_sys_environment": true,
    "job_accounting": {},
    "job_id": "9671041.isrv5",
    "last_jobinfo": "{\"job_id\": \"9671041.isrv5\", \"title\": \"aiida-142009\", \"num_machines\": 2, \"num_cpus\": 48, \"queue_name\": \"qprod\", \"num_mpiprocs\": 48, \"RequestedWallclockTime\": 12839, \"raw_data\": {\"comment\": \"Not Running: Host set host=r26u17n698 has too few free resources\", \"keep_files\": \"n\", \"rerunable\": \"False\", \"eligible_time\": \"04:03:35\", \"resource_list.ncpus\": \"48\", \"qtime\": \"Thu Nov 28 11:17:19 2019\", \"resource_list.walltime\": \"03:33:59\", \"resource_list.select\": \"2:mpiprocs=24:ncpus=24\", \"resource_list.nice\": \"15\", \"join_path\": \"n\", \"mtime\": \"Thu Nov 28 15:21:08 2019\", \"account_name\": \"OPEN-16-36\", \"substate\": \"10\", \"priority\": \"0\", \"resource_list.mpiprocs\": \"48\", \"resource_list.fairshare\": \"979489\", \"job_state\": \"Q\", \"etime\": \"Thu Nov 28 11:17:19 2019\", \"checkpoint\": \"u\", \"resource_list.nodect\": \"2\", \"job_owner\": \"zeleznyj@login4.head.smc.salomon.it4i.cz\", \"job_name\": \"aiida-142009\", \"variable_list\": \"LMOD_SETTARG_CMD=:,LMOD_DIR=/apps/all/Lmod/7.7.7/lmod/lmod/libexec,CLUSTERNAME=SALOMON,PBS_O_HOME=/home/zeleznyj,LMOD_SETTARG_FULL_SUPPORT=no,LESSOPEN=||/usr/bin/lesspipe.sh %s,PBS_O_LOGNAME=zeleznyj,SSH_CLIENT=147.231.126.109 53028 22,LMOD_PKG=/apps/all/Lmod/7.7.7/lmod/lmod,PBS_O_LANG=en_US.UTF-8,USER=zeleznyj,MODULEPATH_ROOT=/apps/all/Lmod/7.7.7/modulefiles,PATH=/usr/local/bin:/usr/bin:/opt/pbs/default/bin:/opt/puppetlabs/bin:/home/zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-response-devel/python_scripts:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin:/home/zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-response-devel/python_scripts:/home/zeleznyj/bin,PS1=\\\\\\\\[\\\\\\\\e[38;5;35m\\\\\\\\]salomon:\\\\\\\\[\\\\\\\\033[0;30m\\\\\\\\]$(  _dir_chomp \\\\\\\"$(pwd)\\\\\\\" 20)\\\\\\\\[\\\\\\\\033[0;30m\\\\\\\\]\\\\\\\\$ \\\\\\\\[\\\\\\\\033[00m\\\\\\\\],LANG=en_US.UTF-8,SHELL=/bin/bash,XAUTHORITY=/home/zeleznyj/.Xauthority,SHLVL=2,LMOD_PREPEND_BLOCK=normal,HISTSIZE=,LMOD_sys=Linux,PBS_O_WORKDIR=/scratch/work/user/zeleznyj/aiida/21/92/7473-256f-4c89-b8e7-627bb3c1c808,PBS_O_PATH=/usr/local/bin:/usr/bin:/opt/pbs/default/bin:/opt/puppetlabs/bin:/home/zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-response-devel/python_scripts:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin:/home/zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-response-devel/python_scripts:/home/zeleznyj/bin,MANPATH=/apps/all/Lmod/7.7.7/lmod/lmod/share/man::/opt/pbs/default/share/man:/opt/puppetlabs/puppet/share/man:/opt/pbs/default/share/man,HOME=/home/zeleznyj,MODULESHOME=/apps/all/Lmod/7.7.7/lmod/lmod,PROMPT_DIRTRIM=3,PBS_O_SYSTEM=Linux,SHOST=login4,XDG_RUNTIME_DIR=/run/user/5297,LMOD_arch=x86_64,LMOD_VERSION=7.7.7,SELINUX_ROLE_REQUESTED=,MAIL=/var/spool/mail/zeleznyj,BASH_FUNC_module()=() {  eval $($LMOD_CMD bash \\\\\\\"$@\\\\\\\") && eval $(${LMOD_SETTARG_CMD:-:} -s sh)\\n},BASH_ENV=/apps/all/Lmod/7.7.7/lmod/lmod/init/bash,XDG_SESSION_ID=47238,PBS_O_MAIL=/var/spool/mail/zeleznyj,_=/opt/pbs/default/bin/qsub,MODULEPATH=/apps/modules/bio:/apps/modules/cae:/apps/modules/chem:/apps/modules/compiler:/apps/modules/data:/apps/modules/debugger:/apps/modules/devel:/apps/modules/geo:/apps/modules/lang:/apps/modules/lib:/apps/modules/math:/apps/modules/mpi:/apps/modules/numlib:/apps/modules/OS:/apps/modules/perf:/apps/modules/phys:/apps/modules/python:/apps/modules/system:/apps/modules/toolchain:/apps/modules/tools:/apps/modules/vis:/apps/modules/vizserv,LMOD_FULL_SETTARG_SUPPORT=no,PBS_O_SHELL=/bin/bash,LMOD_CMD=/apps/all/Lmod/7.7.7/lmod/lmod/libexec/lmod,OLDPWD=/home/zeleznyj,HOSTNAME=login4,SELINUX_LEVEL_REQUESTED=,HISTCONTROL=ignoredups,XDG_DATA_DIRS=/home/zeleznyj/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share,PWD=/scratch/work/user/zeleznyj/aiida/21/92/7473-256f-4c89-b8e7-627bb3c1c808,SELINUX_USE_CURRENT_RANGE=,BASH_FUNC_ml()=() {  eval $($LMOD_DIR/ml_cmd \\\\\\\"$@\\\\\\\")\\n},LMOD_PACKAGE_PATH=/apps/all/Lmod/7.7.7/lmod/etc,LOGNAME=zeleznyj,SSH_CONNECTION=147.231.126.109 53028 195.113.250.134 22,LMOD_COLORIZE=yes,PBS_O_QUEUE=qprod,PBS_O_HOST=login4.head.smc.salomon.it4i.cz\", \"project\": \"_pbs_project_default\", \"resource_list.place\": \"free\", \"submit_arguments\": \"_aiidasubmit.sh\", \"mail_points\": \"n\", \"hold_types\": \"n\", \"error_path\": \"login4.head.smc.salomon.it4i.cz:/scratch/work/user/zeleznyj/aiida/21/92/7473-256f-4c89-b8e7-627bb3c1c808/_scheduler-stderr.txt\", \"ctime\": \"Thu Nov 28 11:17:19 2019\", \"warning_fields_with_newlines\": [\"variable_list\"], \"server\": \"isrv5\", \"queue\": \"qprod\", \"output_path\": \"login4.head.smc.salomon.it4i.cz:/scratch/work/user/zeleznyj/aiida/21/92/7473-256f-4c89-b8e7-627bb3c1c808/_scheduler-stdout.txt\"}, \"job_owner\": \"zeleznyj\", \"annotation\": \"Not Running: Host set host=r26u17n698 has too few free resources\", \"job_state\": \"queued\", \"submission_time\": {\"date\": \"2019-11-28T11:17:19.000000\", \"timezone\": null}, \"job_substate\": \"10\"}",
    "max_wallclock_seconds": 12839,
    "mpirun_extra_params": [],
    "output_filename": "linres_out",
    "parser_name": "linres",
    "prepend_text": "",
    "process_label": "LinresCalculation",
    "process_state": "finished",
    "queue_name": "qprod",
    "remote_workdir": "/scratch/work/user/zeleznyj/aiida/21/92/7473-256f-4c89-b8e7-627bb3c1c808",
    "resources": {
        "default_mpiprocs_per_machine": 24,
        "num_cores_per_mpiproc": 1,
        "num_machines": 2,
        "num_mpiprocs_per_machine": 24
    },
    "retrieve_list": [
        "linres_out",
        "_scheduler-stdout.txt",
        "_scheduler-stderr.txt"
    ],
    "retrieve_temporary_list": [],
    "scheduler_lastchecktime": "2019-11-28T14:39:09.311833+00:00",
    "scheduler_state": "done",
    "scheduler_stderr": "_scheduler-stderr.txt",
    "scheduler_stdout": "_scheduler-stdout.txt",
    "sealed": true,
    "version": {
        "core": "1.0.0",
        "plugin": "0.7.3"
    },
    "withmpi": true
}

And the process report:

*** 142009: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 70 LOG MESSAGES:
+-> ERROR at 2019-11-28 14:41:52.279161+00:00
 | parser returned exit code<100>: Calculation did not produce all expected output files.
+-> ERROR at 2019-11-28 14:41:52.234269+00:00
 | Found files '[]', expected to find '[u'linres_out']'
+-> WARNING at 2019-11-28 14:39:09.286449+00:00
 | Job info not present, assuming it was completed
+-> WARNING at 2019-11-28 14:21:21.694051+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 14:18:22.113238+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 14:13:48.955143+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 14:10:47.889369+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 14:07:27.741239+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 14:03:35.906431+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:58:43.536306+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:55:42.732699+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:47:23.647461+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:44:22.760937+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:38:22.125500+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:35:20.974957+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:32:16.716929+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:23:13.826815+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:15:07.595462+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:09:41.679115+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 13:01:03.077900+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:58:00.946383+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:54:08.722780+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:50:09.428446+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:47:09.008112+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:44:38.716389+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:40:20.369477+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:35:10.483505+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:32:10.424486+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:27:18.533608+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:23:02.347553+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:19:30.582474+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:16:09.534070+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:11:56.147400+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:07:49.717746+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:03:23.134636+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 12:00:30.148316+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:55:43.782090+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:52:42.929186+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:48:30.993009+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:45:31.058787+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:42:27.902743+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:39:28.111814+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:36:24.927292+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:33:24.477290+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:31:18.665683+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:28:30.005357+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:25:29.197887+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:22:27.647306+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:20:25.268243+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:17:26.398864+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:14:23.658582+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:11:22.875130+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:08:22.055275+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:05:20.686598+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 11:02:18.087418+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:59:46.093265+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:57:43.753862+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:54:42.955224+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:51:42.141018+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:48:41.395709+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:45:38.962980+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:42:38.059482+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:39:37.340826+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:36:36.479974+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:33:35.674439+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:30:37.240164+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:27:33.937199+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:24:22.154178+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:21:21.392419+00:00
 | Job state: JobState.QUEUED
+-> WARNING at 2019-11-28 10:18:20.608976+00:00
 | Job state: JobState.QUEUED

and the qstat output from the remote:

Job Id: 9671041.isrv5
    Job_Name = aiida-142009
    Job_Owner = zeleznyj@login4.head.smc.salomon.it4i.cz
    job_state = Q
    queue = qprod
    server = isrv5
    Account_Name = OPEN-16-36
    Checkpoint = u
    ctime = Thu Nov 28 11:17:19 2019
    Error_Path = login4.head.smc.salomon.it4i.cz:/scratch/work/user/zeleznyj/ai
    ida/21/92/7473-256f-4c89-b8e7-627bb3c1c808/_scheduler-stderr.txt
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Thu Nov 28 15:46:38 2019
    Output_Path = login4.head.smc.salomon.it4i.cz:/scratch/work/user/zeleznyj/a
    iida/21/92/7473-256f-4c89-b8e7-627bb3c1c808/_scheduler-stdout.txt
    Priority = 0
    qtime = Thu Nov 28 11:17:19 2019
    Rerunable = False
    Resource_List.fairshare = 979524
    Resource_List.mpiprocs = 48
    Resource_List.ncpus = 48
    Resource_List.nice = 15
    Resource_List.nodect = 2
    Resource_List.place = free
    Resource_List.select = 2:mpiprocs=24:ncpus=24
    Resource_List.walltime = 03:33:59
    substate = 10
    Variable_List = LMOD_SETTARG_CMD=:,
    LMOD_DIR=/apps/all/Lmod/7.7.7/lmod/lmod/libexec,CLUSTERNAME=SALOMON,
    PBS_O_HOME=/home/zeleznyj,LMOD_SETTARG_FULL_SUPPORT=no,
    LESSOPEN=||/usr/bin/lesspipe.sh %s,PBS_O_LOGNAME=zeleznyj,
    SSH_CLIENT=147.231.126.109 53028 22,
    LMOD_PKG=/apps/all/Lmod/7.7.7/lmod/lmod,PBS_O_LANG=en_US.UTF-8,
    USER=zeleznyj,MODULEPATH_ROOT=/apps/all/Lmod/7.7.7/modulefiles,
    PATH=/usr/local/bin:/usr/bin:/opt/pbs/default/bin:/opt/puppetlabs/bin:
    /home/zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-response-de
    vel/python_scripts:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin:/home
    /zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-response-devel/p
    ython_scripts:/home/zeleznyj/bin,
    PS1=\\[\\e[38;5;35m\\]salomon:\\[\\033[0;30m\\]$(  _dir_chomp \"$(pwd)
    \" 20)\\[\\033[0;30m\\]\\$ \\[\\033[00m\\],LANG=en_US.UTF-8,
    SHELL=/bin/bash,XAUTHORITY=/home/zeleznyj/.Xauthority,SHLVL=2,
    LMOD_PREPEND_BLOCK=normal,HISTSIZE=,LMOD_sys=Linux,
    PBS_O_WORKDIR=/scratch/work/user/zeleznyj/aiida/21/92/7473-256f-4c89-b
    8e7-627bb3c1c808,
    PBS_O_PATH=/usr/local/bin:/usr/bin:/opt/pbs/default/bin:/opt/puppetlab
    s/bin:/home/zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-respo
    nse-devel/python_scripts:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin
    :/home/zeleznyj/.local/bin:/home/zeleznyj/bin/wannier-linear-response-d
    evel/python_scripts:/home/zeleznyj/bin,
    MANPATH=/apps/all/Lmod/7.7.7/lmod/lmod/share/man::/opt/pbs/default/sha
    re/man:/opt/puppetlabs/puppet/share/man:/opt/pbs/default/share/man,
    HOME=/home/zeleznyj,MODULESHOME=/apps/all/Lmod/7.7.7/lmod/lmod,
    PROMPT_DIRTRIM=3,PBS_O_SYSTEM=Linux,SHOST=login4,
    XDG_RUNTIME_DIR=/run/user/5297,LMOD_arch=x86_64,LMOD_VERSION=7.7.7,
    SELINUX_ROLE_REQUESTED=,MAIL=/var/spool/mail/zeleznyj,
    BASH_FUNC_module()=() {  eval $($LMOD_CMD bash \"$@\") && eval $(${LMO
    D_SETTARG_CMD:-:} -s sh)
},
    BASH_ENV=/apps/all/Lmod/7.7.7/lmod/lmod/init/bash,
    XDG_SESSION_ID=47238,PBS_O_MAIL=/var/spool/mail/zeleznyj,
    _=/opt/pbs/default/bin/qsub,
    MODULEPATH=/apps/modules/bio:/apps/modules/cae:/apps/modules/chem:/app
    s/modules/compiler:/apps/modules/data:/apps/modules/debugger:/apps/modu
    les/devel:/apps/modules/geo:/apps/modules/lang:/apps/modules/lib:/apps/
    modules/math:/apps/modules/mpi:/apps/modules/numlib:/apps/modules/OS:/a
    pps/modules/perf:/apps/modules/phys:/apps/modules/python:/apps/modules/
    system:/apps/modules/toolchain:/apps/modules/tools:/apps/modules/vis:/a
    pps/modules/vizserv,LMOD_FULL_SETTARG_SUPPORT=no,PBS_O_SHELL=/bin/bash,
    LMOD_CMD=/apps/all/Lmod/7.7.7/lmod/lmod/libexec/lmod,
    OLDPWD=/home/zeleznyj,HOSTNAME=login4,SELINUX_LEVEL_REQUESTED=,
    HISTCONTROL=ignoredups,
    XDG_DATA_DIRS=/home/zeleznyj/.local/share/flatpak/exports/share:/var/l
    ib/flatpak/exports/share:/usr/local/share:/usr/share,
    PWD=/scratch/work/user/zeleznyj/aiida/21/92/7473-256f-4c89-b8e7-627bb3
    c1c808,SELINUX_USE_CURRENT_RANGE=,
    BASH_FUNC_ml()=() {  eval $($LMOD_DIR/ml_cmd \"$@\")
},
    LMOD_PACKAGE_PATH=/apps/all/Lmod/7.7.7/lmod/etc,LOGNAME=zeleznyj,
    SSH_CONNECTION=147.231.126.109 53028 195.113.250.134 22,
    LMOD_COLORIZE=yes,PBS_O_QUEUE=qprod,
    PBS_O_HOST=login4.head.smc.salomon.it4i.cz
    comment = Not Running: Host set host=r26u17n698 has too few free resources
    etime = Thu Nov 28 11:17:19 2019
    eligible_time = 04:29:16
    Submit_arguments = _aiidasubmit.sh
    estimated.exec_vnode = (r2i4n3:ncpus=24)+(r3i6n2:ncpus=24)
    estimated.start_time = Thu Nov 28 16:06:13 2019
    project = _pbs_project_default

zhubonan commented 4 years ago

+-> ERROR at 2019-11-28 14:41:52.234269+00:00 | Found files '[]', expected to find '[u'linres_out']' +-> WARNING at 2019-11-28 14:39:09.286449+00:00 | Job info not present, assuming it was completed

Here - it appears that aiida thinks the job is finished because it is not found when a list of running job is requested from the scheduler.

sphuber commented 4 years ago

Here - it appears that aiida thinks the job is finished because it is not found when a list of running job is requested from the scheduler.

This is something that I put in and is expected. Certain schedulers, like SLURM and PBS (for some configurations), do not include finished jobs in the response of their job status commands. For SLURM then, completed jobs will never appear in the list. So the only way for AiiDA to determine that a job is done, is by it not being present in the list. As far as I know then, there is then also no way to confirm that the job is actually completed.

Zeleznyj commented 4 years ago

Do you think it is a problem with aiida or with the scheduler itself? I guess it is possible that the scheduler sometimes returns empty result even if the job is still running, but it has never happened to me. It happens sometimes, however, that the response is very slow or possibly even hangs completely.

By the way, in the pbspro version that I'm using, the qstat -f command for a finished job returns the following: qstat: 9671041.isrv5 Job has finished, use -x or -H to obtain historical job information

sphuber commented 4 years ago

but we see a similar behavior also with Slurm, or at least we definitely did in the past.

How do you know this for sure? This was before we put the extra debug statements and the append_text and prepend_text correct?

I cannot exclude that it happens in AiiDA, but the mechanism for determining whether something is done is very simple. We call the scheduler command to retrieve the status for all job ids. If the command returns any kind of error or return value is non-zero, the command is rescheduled and performed again. So only if the command was marked as successful do we continue. Then we just parse the output for the state of the jobs. As mentioned before, for certain schedulers such as SLURM the only way to determine whether it is done, is by it being absent. Unless sacct is enabled, I don't even think a command exists to "confirm". Now maybe there is an error in the parsing. We should just add a debug line to always print the stdout returned by the scheduler. If we then see job assumed done because missing, we can look at the same time in the daemon log to the actual corresponding output and see if it was actually in there but in a weird format that the parser maybe misinterpreted.

Zeleznyj commented 4 years ago

Yes we saw this before the append_text and prepend_text. With slurm, we saw this behavior in two cases. One when the verdi process report contains a submission error in which case the calculation would be retrieved very soon after submission, this is the case we now understand. In the second case, there is no submission error and the calculation does not get retrieved immediately and that seems very similar to the case I now reported with pbspro.

I've been testing now on my laptop, where I could so far reproduce only the issue with the submission error. I have a different computer where we run production calculations. There, we use both pbspro and slurm for different calculations. There I have now also pulled the fix_2431_scheduler_parsing branch and reproduced the issue with pbspro. I will now test if we also still see the issue with slurm.

Zeleznyj commented 4 years ago

It seems that at least for the pbspro version that I'm using a simple fix might be to use -x parameter for the qstat -f command. This will show even finished jobs and from looking at the source, it seems the parsing should then correctly identify the job as done.

sphuber commented 4 years ago

I have added a commit to fix_2431_scheduler_parsing to also add debug logging of the full scheduler queue response for the PBSPRo scheduler. This way, once you encounter a calculation with the message Job info not present, assuming it was completed we can check the corresponding output in the daemon log and verify that it indeed is not there.

It seems that at least for the pbspro version that I'm using a simple fix might be to use -x parameter for the qstat -f command. This will show even finished jobs and from looking at the source, it seems the parsing should then correctly identify the job as done.

What version of PBS are you using. Checking the documentation of the current version 6.12 I see no -x option for qstat. In fact, there it suggests that completed jobs are included by default and the -c option is there to omit them.

Zeleznyj commented 4 years ago

I will try this out.

The version of pbspro we are using is 19.2.4, which seems to be quite recent. The 6.12 is I think Torque, which is a separate software from the PBS Professional.

One thing that I forgot to mention is that we are using slightly modified version of the pbspro plugin. I don't think it's related though since we only sligthly modified the resources format in _get_resource_lines and haven't modified the PbsBaseClass at all. In our version of pbspro the number of cores is specified by ncpus and not by ppn and this is the only thing we changed.

sphuber commented 4 years ago

The version of pbspro we are using is 19.2.4, which seems to be quite recent. The 6.12 is I think Torque, which is a separate software from the PBS Professional.

Ah of course, it's confusing that they seem to use the same commands mostly. Probably one split off of the other at some point.

One thing that I forgot to mention is that we are using slightly modified version of the pbspro plugin. I don't think it's related though since we only sligthly modified the resources format in _get_resource_lines and haven't modified the PbsBaseClass at all. In our version of pbspro the number of cores is specified by ncpus and not by ppn and this is the only thing we changed.

That change should not affect this problem indeed. Looking at the documentation of qstat for PBSPro, although the -x option might help, if the problem is really jobs being missing from the normal output for incorrect reasons, it might give entirely new problems. I did not see how you one can control how far back in time the command will return completed jobs, so at some point the amount of data that will be returned by each qstat -xf call will be huge. On the other hand, even if we were able to specify how many days in the past, how do we determine this value? Set it too high and you are transferring too much data. Too low and you might be back to square one, omitting still running jobs and so falsely thinking they are finished.

greschd commented 4 years ago

On the other hand, even if we were able to specify how many days in the past, how do we determine this value? Set it too high and you are transferring too much data. Too low and you might be back to square one, omitting still running jobs and so falsely thinking they are finished.

In principle we know the age of the oldest still-running calculation that AiiDA is tracking, no?

Zeleznyj commented 4 years ago

I was thinking to simply use the -x when querying for specific job ids. That way we know that if nothing is returned for some jobid then this is probably a problem with pbspro since that should not happen.

Querying with the -x without specifying the job ids might be also possible, it doesn't return all historical jobs, but only for some time. I'm not sure if there's any reason for this though.

Zeleznyj commented 4 years ago

I've done some more testing. You can find the relevant part of the daemon log file here. The process for which the issue occurred is 206691, which has a jobid 9700408.isrv5.

I've removed messages containing [DEBUG] Broadcast message 'None' received with communicator to make the log file cleaner since I assume those are not related and are quite frequent.

The last _parse_joblist_output relevant to this job is at line 467021. In this output the jobid is not present and consequently the job was retrieved. Strangely, in the log file the _parse_joblist_output is repeated twice and the first one is incomplete, I'm not sure if this is related. The second one contains all the jobs apart from 9700408.isrv5 and 9700463.isrv5, however, it is cutoff at some point - the output for the last job is incomplete.

I know that these jobs were actually present at the remote computer even after they were retrieved and finished in aiida since I have a script that checks for such jobs. This script runs qstat -f (without jobid specifiers so it returns all the jobs on the cluster), checks for aiida jobs, then compares this with the list of running calculations within aiida and kills those that are finished in aiida and still running on the cluster.

This might be a problem within pbspro, but it seems a bit strange to me. It might be connected to this server often being overloaded and slowly responding though. I could try to do some testing on the server if I will ever see this behavior directly.

broeder-j commented 4 years ago

This is a really nasty issue. AiiDA is to fast ;-)

If the command returns any kind of error or return value is non-zero, the command is rescheduled and performed again. So only if the command was marked as successful do we continue. Then we just parse the output for the state of the jobs

@sphuber: 2 Maybe ideas worth nothing:

how about if one would add another state maybe submission failed if the scheduler returns error or non-zero and only after aiida checks again if the job is not there or canceled in the job list command after a 'short time frame' (next time joblist command is executed) it says ok the submission really failed and I do the retry stuff. Otherwise if job is there, move to queued or running. Won't solve if submission was fine, but the scheduler was to 'slow' to put job in list, while aiida finds already it is not in list, so done. (but for this is the connection cooldown time)

So the only way for AiiDA to determine that a job is done, is by it not being present in the list. As far as I know then, there is then also no way to confirm that the job is actually completed.

or aiida creates always a file in .aiida on the remote, where the job runs as a last append line in the jobscript, or add something to the calcinfo file. which if present would tell aiida (when retrieving) that the job which ran was really completed. If it is not, you still don't know if the job is still running or crashed.

ltalirz commented 4 years ago

Concerning the question of how to determine whether a job has finished: In my view this is really the scheduler's job and, if possible, we should try to avoid implementing our own solutions to determine this.

Certain schedulers, like SLURM and PBS (for some configurations), do not include finished jobs in the response of their job status commands. For SLURM then, completed jobs will never appear in the list. So the only way for AiiDA to determine that a job is done, is by it not being present in the list. As far as I know then, there is then also no way to confirm that the job is actually completed.

To me, there are two open questions here:

Is it true that we need to support versions of slurm/pbs that do not support showing finished jobs in the job list? (maybe yes)
Even if a finished job is no longer shown in the job list, it should be possible to get information about this specific job, including its status, right? Before deciding that a job is finished, could AiiDA not query the scheduler for the specific status of this job?

greschd commented 4 years ago

On the second point: At least in SLURM (in the configurations I have access to), it is also possible to query jobs by their name -- i.e., aiida-<pk>. That could also circumvent the case where a job was actually submitted, but the call to the scheduler did not return properly.

Zeleznyj commented 4 years ago

In the version of PBS PRO that we are using it is possible to list also finished jobs, but this is not foolproof since it only lists jobs up to some time and this time is probably dependent on the server configuration. It would still be a great help to include this as a possibility, since for me at least this would eliminate the problem in most cases. It should be simply enough to use qstat -f -xinstead of qstat -f.

For me, in most cases the issue seems to occurs because the output from PBS is incomplete - it is abruptly cut off at some point. It would be possible to check whether the output is cut off, by checking if all fields are there. This is again not foolproof and probably also server dependent, but it would help a lot, at least in my case.

We have now finished with the calculations so I cannot test right now, but we will start with high-throughput calculations on the same remote in a few months so then I can try to find a solution and test things out.

This is a big problem for us as it happens very frequently. We had to kill more than 10,000 jobs in a span of few months because of this issue.

aiidateam / aiida-core

Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out #3404