Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out

aiidateam / aiida-core

The official repository for the AiiDA code

https://aiida-core.readthedocs.io

Other

433 stars 186 forks source link

Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out #3404

Open Zeleznyj opened 5 years ago

Zeleznyj commented 5 years ago

I've encountered an issue where sometimes a calculation will show as finished in aiida, but the actual calculation on remote computer is still running. Aiida will retrieve the files and run parser without showing any error. This happened with ssh transport and slurm scheduler. I'm not sure if the problem is necessarily related to slurm though, since we are now not using other schedulers much. The calculations are using our own FPLO calculation plugins. It is possible that the issue is somehow related to some problem in the plugins, but to me it seems like a problem with aiida, since everything on our side is working fine. The calculation is submitted correctly and finishes correctly, the only problem is that the results are retrieved before the remote calculation is finished. This thus looks like a problem with parsing the queue status. The problem happens randomly, when we resubmit a calculation, it will usually finish fine.

I've noticed the problem after checking out the develop branch couple days ago, but most likely the problem existed also before, when I was using the 1.0.06b version.

I can try to include more details, but I'm not sure where to start about debugging this.

ltalirz commented 4 years ago

It would still be a great help to include this as a possibility, since for me at least this would eliminate the problem in most cases. It should be simply enough to use qstat -f -x instead of qstat -f .

Hi @Zeleznyj : just in case this is not yet in aiida-core until then: it should actually be quite straightforward for you to write a scheduler plugin that inherits from the PbsPro (or PbsBase) scheduler and overrides the method

https://github.com/aiidateam/aiida-core/blob/b59769602ff8ea3a150e992d77a5451d20860558/aiida/schedulers/plugins/pbsbaseclasses.py#L148

and then makes it available in the aiida.schedulers entry point group. This way you can have your custom method for treating your scheduler without worrying about overwriting changes when you update AiiDA.

I just checked on the aiida registry and we don't seem to have an example of a scheduler plugin... that could make for a good first example.

sphuber commented 4 years ago

@ltalirz which could tie in neatly to the recently opened issue #3853 . I was going to forward it to the users mailinglist as I think it is a better fit there, but if we add a section to the documentation then we can close that issue.

Zeleznyj commented 4 years ago

Yes, I can definitely try to do this myself, but as I said, it will take several months.

ltalirz commented 4 years ago

Right, let's see whether we can get this started. pinging also @pzarabadip for info, since he had a use case as well (have you already made a scheduler plugin out of it or not?).

ezpzbz commented 4 years ago

Yes, I had a situation on using custom scheduler plugin related to issue #2977 At the time, I made a copy of pbspro plugin and added a separate entry point in aiida-core itself (https://github.com/pzarabadip/aiida-core/commit/3ace86e1fd702661aaf2ab8153f048c3f5344c4f) which has not been a convenient way of doing it. I still have not moved it to my plugins based on @ltalirz nice suggestion. I will let you know once I do it (which would be soon). I can help to update the documents then about implementing custom scheduler plugin in plugins that can somehow address issue #3853

broeder-j commented 4 years ago

In case we are not able to solve the 'submit several' calculation part, due to submission failures.

One possibility to make the calculations at least work fine is to ensure different running directories for every submitted calcjob. I.e if aiida tries to resubmit something the directory where the job actually is executed is changed, ensuring that never two running jobs use the same running directory and of course that aiida knows from which to retrieve the files. For this one could introduce a sub directory structure on the remote machine like <aiida-run_dir>/<uuid>/<submission_try_number>. I do not know how easy that could be done. Currently upload and submit are two separate tasks, the file upload and the preparations of the files happened already before.

ltalirz commented 4 years ago

Coming back to the original issue:

We seem to be experiencing similar issues with the slurm scheduler on fidis.epfl.ch (only, when submitting large numbers of jobs). This does make me question whether the issue is really on the scheduler side - in any case it means we need to find a workaround
@Zeleznyj mentions here that the DEBUG output of the _parse_joblist command is in the log twice: do you mean AiiDA prints things twice or there is some duplication in the pbspro output? It would be great to have a look at the log and the dropbox link you posted no longer works. Could you perhaps attach the log file (or at least a substantial section around the relevant part) directly to the comment?

@sphuber About the changes in your branch - do you think there is an issue with merging them into develop until this issue is resolved? It might make it easier for us to collect input on this issue. I've opened a PR of your rebased branch here https://github.com/aiidateam/aiida-core/pull/3942

sphuber commented 4 years ago

We seem to be experiencing similar issues with the slurm scheduler on fidis.epfl.ch (only, when submitting large numbers of jobs). This does make me question whether the issue is really on the scheduler side - in any case it means we need to find a workaround

I wouldn't be so sure of this. I spent a lot time debugging this and once I could really trace what happened, the problem was really clear. We can verify this if you add the following to your inputs:

inputs = {
....
    'metadata': {
        'options': {
            ...
            'prepend_text': 'echo $SLURM_JOB_ID > prepend.jobid',
            'append_text': 'echo $SLURM_JOB_ID > append.jobid',
        }
    }
}

if my suspicion is correct, you will see different jobids in the prepend.job and append.jobid files. One of those will correspond to the jobid that is stored in the attributes. Please give that a go first.

@Zeleznyj mentions [here](https://github.com/aiidateam/aiida-core/issues/3404#issuecomment-570805521) that the DEBUG output of the `_parse_joblist` command is in the log _twice_: do you mean AiiDA prints things twice or there is some duplication in the pbspro output?
  It would be great to have a look at the log and  the dropbox link you posted no longer works. Could you perhaps attach the log file (or at least a substantial section around the relevant part) directly to the comment?

This is almost certainly unrelated and just due to a bug that I fixed in PR #3889 which is release in v1.2.0.

@sphuber About the changes in [your branch](https://github.com/sphuber/aiida_core/commits/fix_2431_scheduler_parsing)  - do you think there is an issue with merging them into `develop` until this issue is resolved?
  It might make it easier for us to collect input on this issue. I've opened a PR of your rebased branch here #3942

In principle not, but these were really added just because we didn't even know where to look in the beginning. I don't feel like merging those kinds of ad-hoc changes. If you really do feel it is important, then I would first clean it up, apply it to all schedulers as well so at least the changes are consistent across the board.

ltalirz commented 4 years ago

if my suspicion is correct, you will see different jobids in the prepend.job and append.jobid files. One of those will correspond to the jobid that is stored in the attributes. Please give that a go first.

Ah ok, so you think it's really not due to parsing of the scheduler status but the return code when submitting the job. Happy to give that a try as well.

I just did some simple tests on fidis, submitting many jobs in short succession [1,2]. The sbatch command frequently hangs for a couple of seconds - e.g. submitting 500 jobs took about 87s; submitting 1000 jobs took 125s. However, my script checks the exit code of sbatch and I have not been able to get a single non-zero exit code from sbatch.

Is perhaps instead the network connection the issue? Note also @zhubonan's comment, who pointed out he ran into this type of issue when the internet connection dropped. @sphuber: If there is a connection issue while AiiDA checks for the job status, is there a mechanism in place for it to recover?

The only other idea that comes to my mind would be that sbatch somehow has a problem if the same user has multiple ssh connections open and submits from them at the same time (?). It's certainly something that regular (non-AiiDA) users would basically never do. That's also something I can try rather easily.

In order to continue these tests, I guess the best would be to now move to testing from my machine, and setting up a machinery that is more and more similar to what happens inside AiiDA. Happy for pointers on how to best accomplish this.

P.S. Perhaps unrelated, but there are python libraries that seem to be performance-oriented replacements for paramiko, see https://github.com/aiidateam/aiida-core/issues/3929 and may be worth investigating.

[1] Fidis has a limit of 5000 maximum submitted jobs in running/pending state per user (MaxSubmitPU in sacctmgr list qos).

[2] job script

#!/bin/bash -l
#SBATCH --job-name=aiida-submission-debug
#SBATCH --nodes=1             #max 2 for debug
#SBATCH --ntasks=1            #28 cpus per node on fidis
#SBATCH --ntasks-per-node=1  #28 cpus per node on fidis
#SBATCH --time=0:00:10        #max 1h for debug
##SBATCH --partition=debug    #debug partition
#SBATCH --partition=parallel

echo "SLURM ID $SLURM_JOB_ID; TEST_ID $TEST_ID"

test script

#!/bin/bash
set -e
#set -x

date +"%T.%3N"
for i in {1..500}; do
  export TEST_ID=$i
  sbatch test.slurm
  echo "TEST $TEST_ID : exit code $?"

done
date +"%T.%3N"

sphuber commented 4 years ago

Ah ok, so you think it's really not due to parsing of the scheduler status but the return code when submitting the job.

Well this is the case we fully confirmed with the OP of this thread. To resume, the following seems to be happening:

AiiDA calls submit
Scheduler actually receives it and launches the job (could of course be with some delay) with jobid = 1
However, (Scheduler fails to send / AiiDA fails to receive) the response, and so the exponential backoff mechanism is triggered and AiiDA will try to resubmit. This can be seen from the logs

| SchedulerError: Error during submission, retval=1 | stdout= | stderr=sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation | sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
AiiDA submits again, in the same working directory, and this time it receives response from scheduler with jobid = 2.
Job 2 now starts but crashes immediately because the starting files have already been modified by 1 (this behavior is of course code dependent) and job 1 continues to run
AiIDA who polls for jobid=2 now sees the job is done, and starts to retrieve. Since job 2 crashed at the line of the main code in the submit script, the append text is never executed and this will contain the jobid = 1 once written by job 1. However, the prepend text will have been run by job 2 and so that will contain jobid = 2. The same goes for the jobid in the attributes, as the jobid =2 is the only value that AiiDA has successfully received. This corresponds perfectly with the results posted by @Zeleznyj

The problem clearly originates in point 3. The question is, whose fault is it? Is the failure to communicate because SLURM is overloaded and times out to respond in time? Or are there connection issues and the SSH connection is killed before AiiDA can receive the message in full. Here I am not fully sure, it could be both, but given that they say this happens under heavy load of the scheduler (which is on a relatively small machine) it might rather be SLURM that is overloaded.

Is perhaps instead the network connection the issue? Note also @zhubonan's comment, who pointed out he ran into this type of issue when the internet connection dropped. @sphuber: If there is a connection issue while AiiDA checks for the job status, is there a mechanism in place for it to recover?

Yes, like any other transport task, if this fails, it will hit the exponential backoff mechanism (EBM) and try again. Since this is a transient problem for checking the status, this should not be a problem.

In order to continue these tests, I guess the best would be to now move to testing from my machine, and setting up a machinery that is more and more similar to what happens inside AiiDA. Happy for pointers on how to best accomplish this.

Honestly, the best way is to actually use AiiDA. You can even use a dummy script and class, such as the ArithmeticAddCalculation and add the prepend_text and append_text to the options as I described and see if the results match my analysis above.

ltalirz commented 4 years ago

Just as a brief update - I've submitted 1000 calculations in one go on fidis using the script below (using 3 daemon workers and 500 slots per worker) and all finished fine.

Note: Since the "calculation" I'm running would not crash even if the previous one had started running, I'm simply appending job IDs to prepend.joblist and append.joblist. In no case was the length of these files different from 1.

from aiida import orm
from aiida.plugins import DataFactory, CalculationFactory
from aiida import engine
import os

diff_code = orm.Code.get_from_string('diff@fidis')
DiffParameters = DataFactory('diff')
parameters = DiffParameters({'ignore-case': True})

SinglefileData = DataFactory('singlefile')
file1 = SinglefileData(
    file=os.path.abspath('f1'))
file2 = SinglefileData(
    file=os.path.abspath('f2'))

SFC = CalculationFactory('diff')

for i in range(1000):
    builder = SFC.get_builder()
    builder.code = diff_code
    builder.parameters = parameters
    builder.file1 = file1
    builder.file2 = file2
    builder.metadata.description = "job {:03d}".format(i)
    builder.metadata.options.prepend_text =  'echo $SLURM_JOB_ID >> prepend.jobid'
    builder.metadata.options.append_text =  'echo $SLURM_JOB_ID >> append.jobid'
    builder.metadata.options.max_wallclock_seconds = 10
    builder.metadata.options.resources = {
            'num_machines': 1,
            'num_mpiprocs_per_machine': 1,
    }

    result = engine.submit(SFC, **builder)
    print(result)

Zeleznyj commented 4 years ago

The scenario described by @sphuber is indeed something I saw happening and I think we clearly confirmed that, but this is not the only case where I see this issue. We also see that the calculations are sometimes incorrectly retrieved even if there is no submission error. This happens sometimes after a long time and sometimes even if the job is still in the queue so really doesn't seem like the first issue. For me, this second issue is much more common than the first. I saw the second issue mainly with a remote that's using PBSPRO and this is where I tested it, whereas the first issue I saw mainly with a remote that's using SLURM.

In this case it seems that the problem seems to be that aiida recieves an incomplete output from qstat. I'm attaching the _parse_joblist_output from the log file: 206691.short.LOG. The jobs that should be present are: '9700431.isrv5' '9700454.isrv5' '9700388.isrv5' '9700387.isrv5' '9700408.isrv5' '9700463.isrv5'. I know all of these were still in fact present since I have a script separate for aiida that checks for running jobs. You see in the output that the last two are missing and that the output is suddenly cutoff during the output for job '9700387.isrv5'. So the last two jobs are retrieved even if they are still running.

I think the problem is somehow related to this PBSPRO server being often overloaded and very slowly responsive. I'm not running there any calculations right now, so I cannot test, but I should be able to test in a month or so.

giovannipizzi commented 4 years ago

Thanks @Zeleznyj for the report!

~I see 2 possibilities:~

!the scheduler returns an incomplete list of jobs. This is a bit problematic, and I am not sure how to fix this... Maybe we can try to see if we need the extended full log, and use when possible the short job list, if this helps? It would be good to know if you see a truncated output if you run yourself the qstat command on the scheduler very often (e.g. you could run in a for loop, with a small 'sleep', and print the count of bytes with wc -c, and see if sometimes you get a very small number.~
~AiiDA has a problem getting the full output, and only gets a truncated one. Maybe this is related to #3787? Even if I would be surprised, as the file is truncated only at ~32K (suspicious number!), much smaller than what happens in #3787 ~

EDIT: I think the problem is that the scheduler (I think) fails with the following error code: retval: 141

I.e., the return value (error code) is not zero, but 141.

Is you scheduler PBSPRO, and which version? I couldn't find this error here, even when checking numbers where ERRNO%256==141, but I also don't know if this is the right place to look into (it seems the are the errors while submitting, not while calling qstat.

Probably the cause is this: https://github.com/aiidateam/aiida-core/blob/e8e7e46a02d6777d0c92e6090a8061f1f26e3cf2/aiida/schedulers/plugins/pbsbaseclasses.py#L362-L369

However, as mentioned in the comment, this was done because if AiiDA passes an explicit list of jobs, as it does, you would get a non-zero error code (and this is very common: it's exactly when you ask for a finished job. Or at least, this was the case some years ago when I wrote this. So, one cannot just uncomment those lines.

I don't have access to PBSpro anymore. If you wants to help, with PBSPro, you could check which error code do you get if you do a qstat -f command specifying 2 jobs that exist (hopefully this is zero), one that exists and one that doesn't, and two jobs that do not exist? (you can get this with something like qstat -f JOBID1 JOBID2 ; echo $? if I am not mistaken)

Also, you could try to run a "working" qstat many times in a loop, and check if you randomly get a 141 error (or some other error).

Finally the best would be if we find some documentation of what 141 exactly means.

I see a few options:

141 is what is returned if at least a job is missing. Then my analysis above is possibly incorrect
we understand what is the error number issued when asking for a non-existing (e.g. completed) job, and this is always the same, distinguishable from other errors, and independent of the scheduler version. Then, we can uncomment the check for the retval, accepting both zero and this specific non-zero error code
there is no easy way to distinguish the error code for a non-existing (e.g. completed) job and an actual error. Then, we need to find more clever ways to detect these problems.

A final option is that 141 does not come from qstat, but e.g. from bash being interrupted or something like this. We might need to double check, depending on the results of the tests above.

giovannipizzi commented 4 years ago

By the way, it might be comforting to see that there is a distinguishable error message and that this is either 15001 or 15001%256=153: see this source code line

Zeleznyj commented 4 years ago

The PBSPRO version is 19.2.4. I've tested and when a job is missing the error is indeed 153. One possibility is that the 141 is related to some time-out of the request. It seems that in this particular case the response took a very long time. From the log file it seems that the qstat command was issued at 12:09 and the response was received only at 12:25.

12/14/2019 12:09:04 AM <2489> aiida.transport.SshTransport: [DEBUG] Command to be executed: cd '/home/zeleznyj' && qstat -f '9700431.isrv5' '9700454.isrv5' '9700388.isrv5' '9700387.isrv5' '9700408.isrv5' '9700463.isrv5'
12/14/2019 12:25:05 AM <2489> aiida.schedulers.plugins.pbsbaseclasses: [DEBUG] Results of `_parse_joblist_output`:

This slow responsiveness happens quite often with this server.

I will try now to see if I can reproduce the 141 error. It may be difficult to reproduce it though since I strongly suspect now that the overloading of the server is in fact caused by aiida itself and I'm not running any calculations now. I've increased the minimum_job_poll_interval to 180, but I still see that when I run calculations with aiida on this server, the responsives drops dramatically. The problem might be that I was normally running 5 daemon workers and I suppose each one is sending requests independently.

ezpzbz commented 4 years ago

Hi @Zeleznyj You are using IT4Innovations, am I right?

Zeleznyj commented 4 years ago

Hi @pzarabadip, yes this was at IT4Innovations.

sphuber commented 4 years ago

The problem might be that I was normally running 5 daemon workers and I suppose each one is sending requests independently.

Yes, this is correct, the connection and polling limits are guaranteed per daemon worker. So if you have a minimum polling interval with 5 active workers, you can expect 5 poll requests per 180 seconds. The same goes for the SSH transport minimum interval. What have you set that to? You can see it with verdi computer show it should be the safe_interval property

ezpzbz commented 4 years ago

Hi @pzarabadip, yes this was at IT4Innovations.

Great. I'd like to share a recent experience there which may help debugging this issue or at least rule out one of possible sources of issue.

I am running loads of CP2K calculations there using AiiDA and never had any issue till this week. I started getting slow/no reponses at random times that it was even triggering exponential backoff mechanism. Again randomly sometimes it was getting solved or it was insisting for five consecutive times and resulting in putting all jobs on pause.

I've investigated this from my side (Masaryk University) and remote side (IT4I) and found out about the source of issue. IT4I login addresses (for example, salomon.it4i.cz) is a round-robin DNS record for actual login nodes which does the load balancing. In my case and after tracing back the issue, I realized that one of the actual login nodes is blocked from our side (false positive security alert) and therefore, when I was using the general address to connect, it was hanging randomly whenever blocked login node was being called. You may ping the login nodes to see if they are all accessible. Cheers,

Zeleznyj commented 4 years ago

I cannot see the safe_interval in verdi computer show output. I have only found the Connection cooldown time in configure ssh, which is set to 30s. The slow response of the server is probably not just due to aiida. It seems that qstat returns output fast, but qstat -f is slow. I have now tried to run qstat -f with 50 job ids twice at the same time and it was very slow: the first one took 10 minutes and the second 35. This obviously becomes worse when I have 5 aiida daemon workers running, but I would say that this is mainly an issue with the server. I have tried the same thing on a different PBSPRO server and there the response is very fast. Interestingly, it is usually much faster to run qstat -f without any job ids, which returns information for all the running jobs.

I have done some testing now with PBSPRO. The error code of qstat -f when an already finished job is included is 35 (it's this one: #define PBSE_HISTJOBID 15139 /* History job ID *). So far the only error codes I saw were 35 (when job has already finished) and 153 (when I use nonexistent job id), but I continue testing. qstat will also print out an error message, when a job has finished or does not exist (for example qstat: 9872770.isrv5 Job has finished, use -x or -H to obtain historical job information).

@pzarabadip I don't think I have this issue, for me the connection to IT4I has been very stable. It's interesting that you haven't seen the same issue as I have though (with jobs being retrieved while still running).

ltalirz commented 4 years ago

I cannot see the safe_interval in verdi computer show output.

Yeah, this is in verdi computer configure show ... (because it was set with verdi computer configure). I also find this a bit confusing sometimes

sphuber commented 4 years ago

I just ran into this issue on Piz Daint, which uses a SLURM scheduler. The parsing of the output file failed and after some investigation it is clear that it is because two processes were writing to it at the same time. Sure enough, looking in the process report, I saw that the submit task failed once. The scheduler stdout also contained the output of two SLURM jobs. So this is a clear example where the scheduler submits the first request but then the communication of the result to AiiDA times out and so AiiDA submits again, resulting in two processes running in the same folder.

I notice that this was in a particularly busy moment on the cluster. Yesterday was the end of the allocation period and the queue was extraordinarily busy. Calling squeue on the cluster directly also took a long time to respond. It therefore seems very likely that the scheduler just had trouble to keep up and respond to AiiDA in time and it timed out.

I don't think it will be easy or possible to detect this and prevent the second submission fully, but at least we can adapt the submission scripts to write a lock file when executed and if it is present, we abort. This can then be parsed by the scheduler parsing that I have implemented (still not merged but in an open PR) which can then at least fail the calculation with a well defined exit code, and we won't have to search as long.

ltalirz commented 4 years ago

@sphuber We ran into this issue on helvetios/fidis.

Now, following up on your suggestion to investigate the failure mode, I have looked at the historical job records of both clusters during that time using sacct. In particular, one can query for the job ids and work directories of past jobs as follows:

# get work dirs of jobs since 2020-07-01 
sacct -S 2020-07-01 -u ongari --format "jobid,jobname%20,workdir%70" > sacct-list
cat sacct-list | grep scratch | grep aiida_run | awk '{print $3}' > directories
cat directories | wc -l  # prints 4705
cat directories | sort -u | wc -l  # prints 4705

It turned out that the list of work directories is unique. If the reason for the failure was that AiiDA submitted the same job twice (causing one or both of them to fail), then we should see the corresponding work directories appear twice in the list. I.e. in this particular case, I believe the issue occurs at the point where AiiDA queries the scheduler for the job status and gets an incomplete/incorrect answer.

As mentioned during our discussion, one solution of this issue could be to add a fail-safe mechanism when AiiDA gets the information that a job has completed. This could be any of

running squeue again, just to be sure
using an alternative command, that queries the status of that job specifically, such as
- scontrol show job <id>
- sacct -j <id>

and only accepting the fact that the job is no longer running, if the output agrees.

Mentioning @danieleongari for info

ltalirz commented 4 years ago

Following a brief discussion with @giovannipizzi and @sphuber , here follows some inspection of one of the failed calculation nodes:

$ verdi process report 896549
*** 896549 [CELL_OPT]: None
*** (empty scheduler output file)
*** (empty scheduler errors file)
*** 1 LOG MESSAGES:
+-> REPORT at 2020-08-10 21:16:13.126240+00:00
 | [896549|Cp2kCalculation|on_except]: Traceback (most recent call last):
 |   File "/home/daniele/anaconda3/envs/aiida1/lib/python3.6/site-packages/plumpy/process_states.py", line 225, in execute
 |     result = self.run_fn(*self.args, **self.kwargs)
 |   File "/home/daniele/aiida1/aiida_core/aiida/engine/processes/calcjobs/calcjob.py", line 286, in parse
 |     exit_code = execmanager.parse_results(self, retrieved_temporary_folder)
 |   File "/home/daniele/aiida1/aiida_core/aiida/engine/daemon/execmanager.py", line 439, in parse_results
 |     exit_code = parser.parse(**parse_kwargs)
 |   File "/home/daniele/aiida1/aiida-cp2k/aiida_cp2k/parsers/__init__.py", line 34, in parse
 |     exit_code = self._parse_stdout()
 |   File "/home/daniele/aiida1/aiida-lsmo/aiida_lsmo/parsers/__init__.py", line 62, in _parse_stdout
 |     raise OutputParsingError("CP2K did not finish properly.")
 | aiida.common.exceptions.OutputParsingError: CP2K did not finish properly.

And its attributes:

In [1]: c = load_node(896549)

In [2]: c.attributes
Out[2]:
{'job_id': '1219409',
 'sealed': True,
 'version': {'core': '1.3.0', 'plugin': '1.1.0'},
 'withmpi': True,
 'exception': 'aiida.common.exceptions.OutputParsingError: CP2K did not finish properly.\n',
 'resources': {'num_machines': 2, 'num_mpiprocs_per_machine': 36},
 'append_text': '',
 'parser_name': 'lsmo.cp2k_advanced_parser',
 'prepend_text': '',
 'last_job_info': {'title': 'aiida-896549',
  'job_id': '1219409',
  'raw_data': ['1219409',
   'R',
   'None',
   'h078',
   'ongari',
   '2',
   '72',
   'h[078,080]',
   'parallel',
   '3-00:00:00',
   '5:03:06',
   '2020-08-10T18:06:28',
   'aiida-896549',
   '2020-08-10T16:17:45'],
  'job_owner': 'ongari',
  'job_state': 'running',
  'annotation': 'None',
  'queue_name': 'parallel',
  'num_machines': 2,
  'num_mpiprocs': 72,
  'dispatch_time': {'date': '2020-08-10T18:06:28.000000', 'timezone': None},
  'submission_time': {'date': '2020-08-10T16:17:45.000000', 'timezone': None},
  'allocated_machines_raw': 'h[078,080]',
  'wallclock_time_seconds': 18186,
  'requested_wallclock_time_seconds': 259200},
 'process_label': 'Cp2kCalculation',
 'process_state': 'excepted',
 'retrieve_list': ['aiida.out',
  'aiida-1.restart',
  'aiida-pos-1.dcd',
  '_scheduler-stdout.txt',
  '_scheduler-stderr.txt'],
 'input_filename': 'aiida.inp',
 'remote_workdir': '/scratch/ongari/aiida_run/d4/11/b7ad-c962-4997-bab6-0788b3fdec49',
 'output_filename': 'aiida.out',
 'scheduler_state': 'done',
 'scheduler_stderr': '_scheduler-stderr.txt',
 'scheduler_stdout': '_scheduler-stdout.txt',
 'detailed_job_info': {'retval': 0,
  'stderr': '',
  'stdout': 'AllocCPUS|Account|AssocID|AveCPU|AvePages|AveRSS|AveVMSize|Cluster|Comment|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|Eligible|End|ExitCode|GID|Group|JobID|JobName|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOSRAW|ReqCPUS|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|Timelimit|TotalCPU|UID|User|UserCPU|\n72|lsmo|882|||||helvetios||15-10:15:36|1332936|0:0|05:08:33|2020-08-10T16:17:45|Unknown|0:0|10697|lsmo|1219409|aiida-896549||||||||||72|2|h[078,080]||5838|parallel|1|72|01:48:43|5-10:27:36|469656|2020-08-10T18:06:28|RUNNING|2020-08-10T16:17:45|00:00:00|00:00:00|3-00:00:00|00:00:00|162182|ongari|00:00:00|\n36|lsmo|882|||||helvetios||7-17:07:48|666468||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.batch|batch||||||||||36|1|h078|1||||36||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|\n72|lsmo|882|||||helvetios||15-10:15:36|1332936||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.extern|extern||||||||||72|2|h[078,080]|2||||72||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|\n72|lsmo|882|||||helvetios||15-10:13:12|1332792||05:08:31|2020-08-10T18:06:30|Unknown|0:0|||1219409.0|cp2k.popt||||||||||72|2|h[078,080]|72||||72||||2020-08-10T18:06:30|RUNNING|2020-08-10T18:06:30|00:00:00|00:00:00||00:00:00|||00:00:00|\n'},
 'mpirun_extra_params': [],
 'environment_variables': {},
 'max_wallclock_seconds': 259200,
 'import_sys_environment': True,
 'submit_script_filename': '_aiidasubmit.sh',
 'retrieve_temporary_list': [],
 'scheduler_lastchecktime': '2020-08-10T21:12:31.029024+00:00',
 'custom_scheduler_commands': ''}

The "smoking gun" would seem to be in the detailed job info, which clearly specifies that, according to slurm, the job is still in state RUNNING:

AllocCPUS|Account|AssocID|AveCPU|AvePages|AveRSS|AveVMSize|Cluster|Comment|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|Eligible|End|ExitCode|GID|Group|JobID|JobName|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOSRAW|ReqCPUS|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|Timelimit|TotalCPU|UID|User|UserCPU|
72|lsmo|882|||||helvetios||15-10:15:36|1332936|0:0|05:08:33|2020-08-10T16:17:45|Unknown|0:0|10697|lsmo|1219409|aiida-896549||||||||||72|2|h[078,080]||5838|parallel|1|72|01:48:43|5-10:27:36|469656|2020-08-10T18:06:28|RUNNING|2020-08-10T16:17:45|00:00:00|00:00:00|3-00:00:00|00:00:00|162182|ongari|00:00:00|
36|lsmo|882|||||helvetios||7-17:07:48|666468||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.batch|batch||||||||||36|1|h078|1||||36||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|
72|lsmo|882|||||helvetios||15-10:15:36|1332936||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.extern|extern||||||||||72|2|h[078,080]|2||||72||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|
72|lsmo|882|||||helvetios||15-10:13:12|1332792||05:08:31|2020-08-10T18:06:30|Unknown|0:0|||1219409.0|cp2k.popt||||||||||72|2|h[078,080]|72||||72||||2020-08-10T18:06:30|RUNNING|2020-08-10T18:06:30|00:00:00|00:00:00||00:00:00|||00:00:00|

For context: the detailed job info is recorded inside task_retrieve_job right before the files of the calculation are retrieved: https://github.com/aiidateam/aiida-core/blob/855ae82e42b8a50e6c507fe9083187a22fe2cfea/aiida/engine/processes/calcjobs/tasks.py#L258 and actually uses sacct for the slurm scheduler https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L233

When looking at the content of the folder today, despite the failed parsing step, the cp2k calculation on the cluster actually finished without issues.

This is compatible with my suspicion that this issue occurs when AiiDA is checking squeue (either on the aiida side or, very possibly, on the slurm side).

ltalirz commented 4 years ago

From our discussion, possible explanations include: 1) for whatever reason, the output of squeue is too long for the ssh buffer (~4 MB according to @giovannipizzi ) This is possible but unlikely since squeue tries to query the jobs per user https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L206-L207 2) perhaps more likely: the output from squeue is incomplete and still parsed by aiida

Re 2. It turns out that AiiDA gets the return value of the joblist command and forwards it to the _parse_joblist_output function of the scheduler plugin, but the function for slurm ignores a nonzero exit status if stderr is empty https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L485-L499

This seems a dangerous practice to me and would probably be quite straightforward to improve.

P.S. Unfortunately, we don't have the complete daemon logs from the time of the failed calculation anymore to check exactly what was printed there. Searching through the recent log history, however, it does seem that this part of the code does catch socket timeouts on squeue from time to time, e.g. :

08/21/2020 09:12:45 AM <26803> aiida.scheduler.slurm: [WARNING] Warning in _parse_joblist_output, non-empty stderr='slurm_load_jobs error: Socket timed out on send/recv operation'
08/21/2020 09:12:45 AM <26803> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/daniele/aiida1/aiida_core/aiida/engine/transports.py", line 103, in request_transport
    yield transport_request.future
  File "/home/daniele/aiida1/aiida_core/aiida/engine/processes/calcjobs/manager.py", line 106, in _get_jobs_from_scheduler
    scheduler_response = scheduler.get_jobs(**kwargs)
  File "/home/daniele/aiida1/aiida_core/aiida/schedulers/scheduler.py", line 340, in get_jobs
    joblist = self._parse_joblist_output(retval, stdout, stderr)
  File "/home/daniele/aiida1/aiida_core/aiida/schedulers/plugins/slurm.py", line 499, in _parse_joblist_output
    raise SchedulerError('Error during squeue parsing (_parse_joblist_output function)')
aiida.schedulers.scheduler.SchedulerError: Error during squeue parsing (_parse_joblist_output function)

In these cases, stderr was non-empty and the scheduler error was raised as designed - but perhaps under high load / special circumstances, stderr can remain empty?

ltalirz commented 4 years ago

@giovannipizzi Can the logic be changed to something like:

if 'invalid job id specified' not in stderr:
    if stderr.strip():
        self.logger.warning("Warning in _parse_joblist_output, non-empty stderr='{}'".format(stderr.strip()))
    if retval !=0:
        raise SchedulerError('Error during squeue parsing (_parse_joblist_output function)')

From your comments in the code it is not entirely clear to me whether the error raised by squeue when specifying job ids of jobs that are no longer known to squeue also results in some identifiable trace in stderr. In my experiments on fidis (slurm 19.05.3-2), I cannot even get a non-zero exit code when providing non-existing job ids to squeue:

$ squeue -u ongari
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5117691    serial aiida-95   ongari  R    2:10:49      1 f004
           5113933    serial aiida-95   ongari  R    3:46:09      1 f004
$ squeue -u ongari --jobs=5117691,5113933
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5117691    serial aiida-95   ongari  R    2:11:04      1 f004
           5113933    serial aiida-95   ongari  R    3:46:24      1 f004
$ squeue -u ongari --jobs=51133333,5117691
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5117691    serial aiida-95   ongari  R    2:11:42      1 f004
$ echo $?
0

Same with slurm 20.02.2 at CSCS

sphuber commented 4 years ago

Thanks for the writeup @ltalirz . I agree with the analysis that the detailed_job_info containing RUNNING for the status means the job status in the update task was incorrect. It indeed seems that the error is in the SLURM plugin that incorrectly parses the output. @ltalirz would you be willing to split your analysis of into a separate issue, and we keep this one for the original issue reported where the same job is submitted twice in the same working directory.

ltalirz commented 4 years ago

Sure, opening https://github.com/aiidateam/aiida-core/issues/4326 and moving my commnents there (I'll hide them here to not make this thread longer than it needs to be)