cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
329 stars 93 forks source link

run-db IntegrityError #920

Closed hjoliver closed 10 years ago

hjoliver commented 10 years ago

From Xiao (Y.Xiao@bom.gov.au)

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/apps/python/2.7.5/lib/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/rundb.py", line 138, in run
    raise e
IntegrityError: columns name, cycle are not unique
Traceback (most recent call last):
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/run.py", line 76, in main
    server.run()
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/scheduler.py", line 1092, in run
    raise self.db.c.exception
IntegrityError: columns name, cycle are not unique
ERROR CAUGHT: cleaning up before exit
2014/03/12 11:56:08 WARNING - some active tasks will be orphaned

@arjclark - could you cast your eye over this?

hjoliver commented 10 years ago

Playing with sqlite3, I get "Error: columns name, cycle are not unique" if I try to insert a new row into task_states when name+cycle already exist in the table; or if I try to update an existing row by name only when the same name is already recorded with several cycles.

hjoliver commented 10 years ago

Bottom of Xiao's suite log:

3796 2014/03/12 11:56:04 INFO - [ConfigLS.2014031002] -Task host: ngamai01.bom.gov.au
3797 2014/03/12 11:56:04 INFO - [Ops_radwind.2014031002] -Task host: ngamai01.bom.gov.au               
3798 2014/03/12 11:56:04 INFO - [Ops_amv.2014031002] -Task host: ngamai01.bom.gov.au
3799 2014/03/12 11:56:05 INFO - [Ops_ass.2014031002] -(current:ready)> Ops_ass.2014031002 submitting now
3800 2014/03/12 11:56:05 INFO - [Ops_atovs.2014031002] -(current:ready)> Ops_atovs.2014031002 submitting now
3801 2014/03/12 11:56:05 INFO - [ConfigLS.2014031002] -(current:ready)> ConfigLS.2014031002 submitting now
3802 2014/03/12 11:56:05 INFO - [Ops_radwind.2014031002] -(current:ready)> Ops_radwind.2014031002 submitting now
3803 2014/03/12 11:56:05 INFO - [Ops_amv.2014031002] -(current:ready)> Ops_amv.2014031002 submitting now
3804 2014/03/12 11:56:06 INFO - [a_pa.2014031001] -(current:running)> for a_pa, arc2mars finished for 2014031001 at 2014-03-12T11:56:05
3805 2014/03/12 11:56:06 INFO - [Ops_ascat.2014031002] -(current:ready)> Ops_ascat.2014031002 submission succeeded
3806 2014/03/12 11:56:06 INFO - [Ops_ascat.2014031002] -(current:submitted)> Ops_ascat.2014031002 submit_method_id=5658920
3807 2014/03/12 11:56:06 INFO - [Ops_atovs.2014031002] -(current:ready)> Ops_atovs.2014031002 submission succeeded
3808 2014/03/12 11:56:06 INFO - [Ops_atovs.2014031002] -(current:submitted)> Ops_atovs.2014031002 submit_method_id=5658917
3809 2014/03/12 11:56:06 INFO - [Ops_radwind.2014031002] -(current:ready)> Ops_radwind.2014031002 submission succeeded
3810 2014/03/12 11:56:06 INFO - [Ops_radwind.2014031002] -(current:submitted)> Ops_radwind.2014031002 submit_method_id=5658921
3811 2014/03/12 11:56:07 INFO - [Ops_ass.2014031002] -(current:ready)> Ops_ass.2014031002 submission succeeded
3812 2014/03/12 11:56:07 INFO - [Ops_ass.2014031002] -(current:submitted)> Ops_ass.2014031002 submit_method_id=5658918
3813 2014/03/12 11:56:07 INFO - [Ops_amv.2014031002] -(current:ready)> Ops_amv.2014031002 submission succeeded
3814 2014/03/12 11:56:07 INFO - [Ops_amv.2014031002] -(current:submitted)> Ops_amv.2014031002 submit_method_id=5658922
3815 2014/03/12 11:56:08 INFO - [ConfigLS.2014031002] -(current:ready)> ConfigLS.2014031002 submission succeeded
3816 2014/03/12 11:56:08 INFO - [ConfigLS.2014031002] -(current:submitted)> ConfigLS.2014031002 submit_method_id=5658919
3817 2014/03/12 11:56:08 INFO - Suite shutting down at 2014-03-12 11:56:08.609640 (ERROR: columns name, cycle are not unique)
3818 2014/03/12 11:56:08 WARNING - some active tasks will be orphaned                                  
3819 2014/03/12 11:56:08 INFO - Thread-4 exit (Job Submission)                                         
3820 2014/03/12 11:56:09 INFO - Thread-2 exit (Event Handlers)
3821 2014/03/12 11:56:09 INFO - Thread-3 exit (Poll & Kill Commands)                                   
3822 2014/03/12 11:56:10 INFO - Thread-5 exit (Request Handling)
hjoliver commented 10 years ago

Bottom of stdout

4562 SUBMIT No.1(1,1): ssh -o BatchMode=yes ngamai01.bom.gov.au 'test -f /etc/profile && . /etc/profile 1>/dev/null 2>&1; test -f $HOME/.profile && . $HOME/.profile 1>/dev/null 2>&1;  mkdir -p $(dirname $HOM
4563 SUBMIT No.1(1,1): ssh -o BatchMode=yes ngamai01.bom.gov.au 'test -f /etc/profile && . /etc/profile 1>/dev/null 2>&1; test -f $HOME/.profile && . $HOME/.profile 1>/dev/null 2>&1;  mkdir -p $(dirname $HOM
4564 SUBMIT No.1(1,1): ssh -o BatchMode=yes ngamai01.bom.gov.au 'test -f /etc/profile && . /etc/profile 1>/dev/null 2>&1; test -f $HOME/.profile && . $HOME/.profile 1>/dev/null 2>&1;  mkdir -p $(dirname $HOM
4565 [Ops_ascat.2014031002 job submission] Your job 5658920 ("Ops_ascat.2014031002") has been submitted
4566 [Ops_atovs.2014031002 job submission] Your job 5658917 ("Ops_atovs.2014031002") has been submitted
4567 [Ops_radwind.2014031002 job submission] Your job 5658921 ("Ops_radwind.2014031002") has been submitted
4568 [Ops_ass.2014031002 job submission] Your job 5658918 ("Ops_ass.2014031002") has been submitted
4569 [Ops_amv.2014031002 job submission] Your job 5658922 ("Ops_amv.2014031002") has been submitted
4570 [ConfigLS.2014031002 job submission] Your job 5658919 ("ConfigLS.2014031002") has been submitted
4571 Suite shutting down at 2014-03-12 11:56:08.609640 (ERROR: columns name, cycle are not unique)
4572 Thread-4 exit (Job Submission)
4573 Thread-2 exit (Event Handlers)
4574 Thread-3 exit (Poll & Kill Commands)
4575 DONE
hjoliver commented 10 years ago

(I don't think the log and out files have any useful information in them).

matthewrmshin commented 10 years ago

Error: columns name, cycle are not unique

Need to do insert or replace.

arjclark commented 10 years ago

@hjoliver - that's what would cause it but I'm not sure how Xiao's suite ended up in this situation - do you know if some combination of warm starts and/or reloads has occurred? insert or replace would prevent the error being thrown but a task shouldn't be able to add itself into the database as a new entry if it already exists so I'd like to get to the bottom of the cause.

arjclark commented 10 years ago

Also I suppose its possible than manual editing of task names in the database while it was running could cause this.

arjclark commented 10 years ago

Any inserts and or purges?

yccxx commented 10 years ago

I did not modify the suite while it is running. so no reload.

The suite was restarted with a "warm" start on 11/03, after "err.2:2014/03/11 21:32:26 CRITICAL - [Errno 9] Bad file descriptor", which caused the suite to frozen. The suite has been running smoothly since then until a self shutdown on 13/03. No other rerun of any kind between 11 and 13.

It is odd when i looked the suite err file since it restarted, "bad file descriptor" happened twice err:2014/03/13 14:10:03 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/13 21:39:29 CRITICAL - [Errno 9] Bad file descriptor but the suite runs OK yesterday why does "Bad file descriptor" happens at similar time of the day (11/03 and 13/03).

I am not sure if "Bad file descriptor" has anything to do with the self shutdown problem.

matthewrmshin commented 10 years ago

We need to figure out what causes the [Errno 9] Bad file descriptor error.

arjclark commented 10 years ago

Agreed with @matthewrmshin we need to get to the bottom of the "Bad file descriptor" problem and find out what is going on there r.e. which files are concerned. I'll see if I can break anything with a warm start on my end.

arjclark commented 10 years ago

@hjoliver - investigating warm starts I haven't been able to reproduce the problem. It looks like there's a serious file system problem on Xiao's end which is affecting the integrity of sqlite.

The only place I can see the sqlite error arising is in the insert into task_states part of rundb.py which should only be invoked if a task does not already exist in the database (lines 241 - 244 of task.py). I'm not keen on using insert or replace as a "fix" as this only masks more serious underlying problem.

Talking to @matthewrmshin we think tracking down the Bad file descriptor error cause is the priority and will hopefully solve whatever circumstance led to the sqlite error observed.

yccxx commented 10 years ago

i went to look at the suite log and err files, and it seems that I have "bad file descriptor" very often. Eg [yix@twister suite]$ grep -i bad * err:2014/03/13 14:10:03 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/13 21:39:29 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/14 10:50:00 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/14 23:40:41 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/15 18:56:51 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/16 00:11:48 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/16 07:00:15 CRITICAL - [Errno 9] Bad file descriptor err:2014/03/16 18:47:32 CRITICAL - [Errno 9] Bad file descriptor log:2014/03/14 23:40:41 CRITICAL - [Errno 9] Bad file descriptor log:2014/03/15 18:56:51 CRITICAL - [Errno 9] Bad file descriptor log:2014/03/16 00:11:48 CRITICAL - [Errno 9] Bad file descriptor log:2014/03/16 07:00:15 CRITICAL - [Errno 9] Bad file descriptor log:2014/03/16 18:47:32 CRITICAL - [Errno 9] Bad file descriptor

Looking at what has happened then, log file shows a "job submission failed". Eg 2014/03/16 07:00:15 INFO - [Analysis_G0.2014031501] -(current:ready)> Analysis_G0.2014031501 submitting now 2014/03/16 07:00:15 CRITICAL - [Errno 9] Bad file descriptor 2014/03/16 07:00:15 WARNING - Analysis_G0.2014031501 job submission failed 2014/03/16 07:00:16 CRITICAL - [Analysis_G0.2014031501] -(current:ready)> Analysis_G0.2014031501 submission failed 2014/03/16 07:00:16 INFO - [Analysis_G0.2014031501] -job submission failed, retrying in 1.0 minutes 2014/03/16 07:01:17 INFO - [Analysis_G0.2014031501] -Task host: ngamai01.bom.gov.au 2014/03/16 07:01:18 INFO - [Analysis_G0.2014031501] -(current:ready)> Analysis_G0.2014031501 submitting now 2014/03/16 07:01:19 INFO - [Analysis_G0.2014031501] -(current:ready)> Analysis_G0.2014031501 submission succeeded 2014/03/16 07:01:19 INFO - [Analysis_G0.2014031501] -(current:submitted)> Analysis_G0.2014031501 submit_method_id=5819178

I am not sure which one caused which. "Bad file descriptor" -> submission failed. or submission failed -> Bad file descriptor? if anone is interested in looking through the log/err files, I can email.

hjoliver commented 10 years ago

Xiao - "job submission failed" is good - that's what's supposed to happen now if you get a "Bad file descriptor" error during job submission. We don't understand the root cause of "Bad file descriptor" but it is nevertheless a job submission failure so it should put the task in "submit-failed" state (or trigger retries as in your suite). When you first encountered "Bad file descriptor" it caused the cylc job submission thread to die and prevented all subsequent job submissions - so at least we've fixed that!

matthewrmshin commented 10 years ago

In which case, I guess the [Errno 9] Bad file descriptor problem is not directly related to the database integrity problem. (Or they can be symptoms of related problems?)

@yccxx please feel free to send your logs to us via the usual email address, i.e. fcm-team@....

yccxx commented 10 years ago

Unfortunately the "self shutdonw" happened again this morning. log shows:

2014/03/18 23:07:58 INFO - Suite shutting down at 2014-03-18 23:07:58.854750 (ERROR: columns name, cycle are not unique)
2014/03/18 23:07:58 WARNING - some active tasks will be orphaned
2014/03/18 23:07:58 INFO - Thread-4 exit (Job Submission)
2014/03/18 23:07:58 INFO - Thread-2 exit (Event Handlers)
2014/03/18 23:07:59 INFO - Thread-3 exit (Poll & Kill Commands)
2014/03/18 23:07:59 INFO - Thread-5 exit (Request Handling)

and err shows:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/apps/python/2.7.5/lib/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/rundb.py", line 138, in run
    raise e
IntegrityError: columns name, cycle are not unique

Traceback (most recent call last):
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/run.py", line 76, in main
    server.run()
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/scheduler.py", line 1092, in run
    raise self.db.c.exception
IntegrityError: columns name, cycle are not unique
ERROR CAUGHT: cleaning up before exit
2014/03/18 23:07:58 WARNING - some active tasks will be orphaned
THE ERROR WAS:
columns name, cycle are not unique
use --debug to turn on exception tracebacks)

I am packing up the err/log/out and send to fcm team and Hilary. There is a disk activity on the task host today, which may affected the $HOME (switch from $backup home to $HOME). The suite has stopped, and it could not be started with a "restart". I will need to warm start it.

yccxx commented 10 years ago

I warm started the suite, and it self-shutdown again, very quickly.

yccxx commented 10 years ago

Looking at the 2nd shuf shutdown, in log fil, there is a message:

2014/03/18 23:55:48 INFO - Suite shutting down at 2014-03-18 23:55:48.261980 (ERROR: [Errno 9] Bad file descriptor)

But the first time, log shows:

2014/03/18 23:07:58 INFO - Suite shutting down at 2014-03-18 23:07:58.854750 (ERROR: columns name, cycle are not unique)

so appears to be for different reasons.

yccxx commented 10 years ago

cylc self shutdonw again! suite err shows:

Traceback (most recent call last):
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/run.py", line 76, in main
    server.run()
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/scheduler.py", line 1029, in run
    self.state_dumper.dump()
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/lib/cylc/suite_state_dumping.py", line 89, in dump
    os.fsync(handle.fileno())
OSError: [Errno 9] Bad file descriptor
ERROR CAUGHT: cleaning up before exit
2014/03/18 23:55:48 WARNING - some active tasks will be orphaned
THE ERROR WAS:
[Errno 9] Bad file descriptor
use --debug to turn on exception tracebacks)
close failed in file object destructor:
IOError: [Errno 9] Bad file descriptor
Traceback (most recent call last):
  File "/cw/keep/cawcr_share/access/apps/cylc/5.4.10/bin/cylc-run", line 220, in <module>
    main("run", start)
SystemError: error return without exception set

Any suggestions on what can be done to avoid this self shutdown?

hjoliver commented 10 years ago

@yccxx - previously you had "Bad file descriptor" errors during job submission, this time it is in the code that writes out state dump files, which simply opens a file, writes some lines to it, calls os.fsync() on it, and then closes it. I don't see much room for any bugs in there. Perhaps you are having system disk problems or have run out of disk quota again? (any "no space left on device" errors, or similar - on the suite host?). Any thoughts @matthewrmshin?

yccxx commented 10 years ago

Hi, Hilary, yes selfshut down happened (twice) early this week is most likely has something to do with disk because at the time HPC is doing some failove tests, and $HOME is switched between $HOME and $HOME_backup. But today, it is error9 again. I will send you the outputs. Thanks

matthewrmshin commented 10 years ago

@hjoliver My feeling is that there is very little we can do when there is a disk error (when the suite is trying to update the runtime database file or the state dumps), apart from shutting down.

However, we have 2 issues here:

We can improve the latter by catching IOError in strategic places and setting its .filename attribute.

hjoliver commented 10 years ago

@arjclark - we suffered the "database integrity error" here over the weekend. It happened on Saturday afternoon, more than 24 hours after the last intervention in the suite, which was a reload + manual task trigger. End of suite log:

2014/04/05 02:45:02 INFO - [csv_mos.2014040421] -(current:ready)> csv_mos.2014040421 submitting now
2014/04/05 02:45:02 INFO - [csv_mos.2014040421] -(current:ready)> csv_mos.2014040421 started at 2014-04-05T02:44:55
2014/04/05 02:45:02 INFO - [csv_mos.2014040421] -(current:running)> csv_mos.2014040421 submission succeeded
2014/04/05 02:45:02 INFO - [csv_mos.2014040421] -(current:running)> csv_mos.2014040421 submit_method_id=fc-1.64328
2014/04/05 02:45:09 INFO - [ncl_eps.2014040421] -(current:submitted)> ncl_eps.2014040421 started at 2014-04-05T02:45:00
2014/04/05 02:45:09 INFO - [csv_mos.2014040421] -(current:running)> csv_mos.2014040421 succeeded at 2014-04-05T02:45:04
2014/04/05 02:45:17 INFO - [ingest_site_mos.2014040421] -(current:ready)> ingest_site_mos.2014040421 submitting now
2014/04/05 02:45:17 INFO - [ingest_site_mos.2014040421] -(current:ready)> ingest_site_mos.2014040421 submission succeeded
2014/04/05 02:45:17 INFO - [ingest_site_mos.2014040421] -(current:submitted)> ingest_site_mos.2014040421 submit_method_id=fc-2.61194
2014/04/05 02:45:17 INFO - Suite shutting down at 2014-04-05 02:45:17.127095 (ERROR: columns name, cycle are not unique)
2014/04/05 02:45:17 WARNING - some active tasks will be orphaned
2014/04/05 02:45:17 INFO - Thread-4 exit (Job Submission)
2014/04/05 02:45:28 INFO - Thread-2 exit (Event Handlers)
2014/04/05 02:45:31 INFO - Thread-3 exit (Poll & Kill Commands)
2014/04/05 02:45:32 INFO - Thread-5 exit (Request Handling)

End of suite err:

2014/04/04 20:29:40 WARNING - [archiving.2014040415] -Assuming non-reported outputs were completed:
archiving.2014040415 submitted
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/gpfs_oper/filesets/oper/admin/um_fcm/cylc/cylc-5.4.12/lib/cylc/rundb.py", line 138, in run
    raise e
IntegrityError: columns name, cycle are not unique

Traceback (most recent call last):
  File "/gpfs_oper/filesets/oper/admin/um_fcm/cylc/cylc-5.4.12/lib/cylc/run.py", line 76, in main
    server.run()
  File "/gpfs_oper/filesets/oper/admin/um_fcm/cylc/cylc-5.4.12/lib/cylc/scheduler.py", line 1078, in run
    raise self.db.c.exception
IntegrityError: columns name, cycle are not unique
ERROR CAUGHT: cleaning up before exit
2014/04/05 02:45:17 WARNING - some active tasks will be orphaned
THE ERROR WAS:
columns name, cycle are not unique

No filesystem problems at the time, as far as we can tell.

yccxx commented 10 years ago

Hello, Hilary

so it is not just at BOM now.

I have disabled "rose task-hook" from the suite --- so no log files are copied back to suite host. since I did that, I have seen only once the suite self-shutdown

Suite shutting down at 2014-04-03 07:48:13.247548 (ERROR: database or disk is full) Thread-4 exit (Job Submission) Thread-2 exit (Event Handlers) Thread-3 exit (Poll & Kill Commands) DONE my disk usage is not high on the suite host.

Xiao

hjoliver commented 10 years ago

@arjclark - this has to be coming from the call to task.py:record_db_state(), no? If so, can we do a live check on whether or not the task already exists in the task state table, before doing the record? (or "insert or replace" as Matt suggested above). Warning messages about possible reasons (a bug in cylc?) and possible consequences could be emitted. At the least, could we add more information to the exception string to say exactly what was being attempted when the error occurred?

arjclark commented 10 years ago

@hjoliver - agreed, will put up a pull request for this. I'm still concerned that this is masking something severely going wrong behind the scenes with cylc as this shouldn't be able to happen in my original implementation of the database code.

@yccxx - that looks like a different problem:

... (ERROR: database or disk is full)

as opposed to:

... (ERROR: columns name, cycle are not unique)

Can we get some more info from the error logs for that?

matthewrmshin commented 10 years ago

@arjclark and @hjoliver One thing we are missing is that we don't have any information of what SQL statement causes the failure. We should print that our in our diagnostics on catching the exception.

arjclark commented 10 years ago

@hjoliver - don't suppose you still have the state dump from that point of failure do you? If so, could you check that for any duplicate (task, cycle) pairings?

arjclark commented 10 years ago

I've raised a pull request #935 to address the symptoms of the integrity error and provide some more helpful warning and error messages when this occurs. Once that's in sites experiencing this problem should keep an eye out for the new integrity warning with a view to identifying what sequence of events have led to this happening.

hjoliver commented 10 years ago

don't suppose you still have the state dump from that point of failure do you?

I saved the logs and the run db before starting the suite again. I couldn't see anything untoward in the db (but then I'm not very adept at sqlite queries). Will take a look at the state file when I get in tomorrow.

hjoliver commented 10 years ago

@arjclark - do you want to update this ticket in light of the problem actually occurring recently at the Met Office (as discussed recently in Hamburg)?

matthewrmshin commented 10 years ago

I can give some detail on the symptom.

arjclark commented 10 years ago

@matthewrmshin has summarised pretty much all we know.

hjoliver commented 10 years ago

At time t2, suite decides to submit job 2 of the same task for some reason

Do you have conclusive evidence of this (from the suite log?) - i.e. that the suite actually submitted the same task (with same try number?) twice? If so, that's encouraging in a way - it implies a bug in cylc that we ought to be able to solve, not some mysterious misbehaviour of sqlite.

matthewrmshin commented 10 years ago

Do you have conclusive evidence of this (from the suite log?)

Yes, the evidence is from the suite log.

arjclark commented 10 years ago

@hjoliver - I'm closing this as #528 should fix Xiao's restart bug and #935 makes use of Insert or Replace, which will stop this error occurring on the database side. I believe the restructuring of the task pool that has occurred at cylc 6 should prevent the secondary task instance happening.

If we're still concerned about the secondary task issue (which you'd see evidence of in the logs via an "INSERT converted to INSERT OR REPLACE" entry) then we can raise a followup issue as the database IntegrityError is fixed.