dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

phedex should detect and refuse to operate multiple download agents on the same link #231

Open ericvaandering opened 11 years ago

ericvaandering commented 11 years ago

Original Savannah ticket 24217 reported by None on Tue Feb 27 11:04:55 2007.

When multiple download agents operate on the same link, all sorts of errors result. It would be nice if phedex detected this configuration conflict and refused to operate in a broken state.

Thanks, --Dan

ericvaandering commented 11 years ago

Comment by egeland on Tue Nov 20 08:28:20 2007

It is currently possible for two FileDownload agents to run for the same link. This causes all sorts of chaos which confuses people and is difficult to track down. This could be prevented if the agent maintained a lock (stored in TMDB) for the link and upon trying to obtain a locked link the agent should print an error message and die.

ericvaandering commented 11 years ago

Comment by egeland on Mon May 3 08:31:20 2010

A recent instance of this blocked FilePump for 3 hours:

https://savannah.cern.ch/support/?113760

Nicolo obtained some debug information:

Increasing the Severity of this ancient bug, due to the global disruption of service demonstrated.

+verbaim+ IT-DB sent us the ASH report in attachment as a follow up to this incident: https://savannah.cern.ch/support/?113760 (FilePump stuck for lock contention with Vanderbilt Download agents on Easter).

If I read the ASH report correctly, it seems that the statements in contention were:

From FilePump: merge into t_xfer_task_done xtd using (select id from t_xfer_task where :now >= time_expire) xt on (xtd.task = xt.id) when not matched then insert (task, report_code, xfer_code, time_xfer, time_update) values (xt.id, -1, -2, -1, :now)

From the Vanderbilt agent: insert into t_xfer_task_done (task, report_code, xfer_code, time_xfer, time_update) values (:p1, -2, -2, -1, :p2)

If I remember correctly, report_code, xfer_code = (-2,-2) means that the agent lost the transfer task - which is not surprising, since Vanderbilt admins accidentally started duplicate agents for the same set of links. So it seems that FilePump was trying to expire some tasks while the duplicate download agent was trying to report them as lost at the same time. -verbatim-