dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Race condition between task expiration in FilePump and task result upload in FileDownload #781

Open ericvaandering opened 11 years ago

ericvaandering commented 11 years ago

Original Savannah ticket 84169 reported by None on Mon Jul 11 12:48:55 2011.

The following alert is relatively common in the central FilePump agent logs (~twice per day)

+verbatim+ 2011-07-11 01:58:06: FilePump[13778]: alert: database error: DBD::Oracle::st execute failed: ORA-00001: unique constraint (CMS_TRANSFERMGMT.PK_XFER_TASK_DONE) violated (DBD ERROR: OCIStmtExecute) [for Statement "merge into t_xfer_task_done xtd using (select id from t_xfer_task where :now >= time_expire) xt on (xtd.task = xt.id) when not matched then insert (task, report_code, xfer_code, time_xfer, time_update) values (xt.id, -1, -2, -1, :now)" with ParamValues: :now=1310349480.75057] at /data/ProdNodes/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322. -verbatim-

It happens when the FilePump agent tries to mark as expired some tasks which were already marked as expired by some site's FileDownload agent, or which were marked as completed by FileDownload just before expiration.

In addition to the alert, the central FilePump agent can also get stuck waiting for the FileDownload agent to commit the transaction to insert the transfer logfiles into t_xfer_error (the amount of data uploaded by FileDownload can be large since it contains the full transfer logs of the failed transfers, and this problem affects links with ~100% error/expiration rate).

Both FilePump and FileDownload will eventually recover automatically as the transfers expire, but it would be good to protect the central agent from getting stuck/skipping cycles.

ericvaandering commented 11 years ago

Comment by magini on Wed Jul 13 05:05:47 2011

Note: on the site agent side, this race condition will generate the following warnings/alerts if the local agent tries to mark as done some tasks which were already marked as expired by the central agents:

FileDownload: +verbatim+ DBD::Oracle::st execute_array warning: ORA-24381: error(s) in array DML (DBD SUCCESS_WITH_INFO: OCIStmtExecute) [for Statement "insert into t_xfer_task_done (task, report_code, xfer_code, time_xfer, time_update) values (?, ?, ?, ?, ?)"] at /srv/localstage/phedex/sw/slc5_amd64_gcc434/cms/PHEDEX/PHEDEX_4_0_0/perl_lib/PHEDEX/Core/DB.pm line 322. -verbatim-

FileMSSMigrate: +verbatim+ 2011-07-12 16:26:09: FileDownload[29606]: alert: database error: DBD::Oracle::st execute failed: ORA-02291: integrity constraint (CMS_TRANSFERMGMT.FK_XFER_TASK_DONE_TASK) violated - parent key not found (DBD ERROR: OCIStmtExecute) [for Statement "insert into t_xfer_task_done (task, report_code, xfer_code, time_xfer, time_update) values (:task, 0, 0, :now, :now)" with ParamValues: :now=1310487966.7025, :task='146939456'] at /data/ProdNodes/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322. -verbatim-

On the site agent side, this problem is relatively harmless: the agent will not get stuck, and the transfer will be marked as successful when the task is recreated by the central agents.

ericvaandering commented 11 years ago

Comment by magini on Thu Mar 22 10:20:29 2012

Additional note: there is also a race condition when FilePump tries to expire a task, and the FileRouter slow flush tries to extend its expiration time simultaneously. This will result in a deadlock between the agents:

+verbatim+ 2012-03-15 19:19:46: FilePump[20143]: alert: database error: DBD::Oracle::st execute failed: ORA-00060: deadlock detected while waiting for resource (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_xfer_task xt where id in (select task from t_xfer_task_harvest)"] at /data/ProdNodes/PHEDEX/perl_lib/PHEDEX/Core/DB.pm line 322. -verbatim-

N.