dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Download agent improperly abandoning jobs #914

Open ericvaandering opened 10 years ago

ericvaandering commented 10 years ago

Original Savannah ticket 98318 reported by None on Fri Oct 19 06:12:39 2012.

Investigating an issue with FTS, I just found out that the download agent does not respect the intended timeout value for abandoning transfer jobs in certain situations.

Intended behaviour: if the job state poll command (e.g. glite-transfer-status) is failing consistently for more than "timeout" (default 1h), abandon the job. If the state poll succeeds at least once per hour, don't abandon the job

Actual behaviour: it the job state poll command fails even once, and the job state hasn't changed in the last "timeout" interval (default 1h), abandon the job immediately

This is a lucky "feature" for us, since there is currently an issue with FTS that causes jobs to be stuck in Ready state forever, and without this bug they would never be marked as abandoned and queued for resubmission. However, it also causes the download agent to also improperly abandon jobs that are in Active state for more than one hour...

Proposing a fix and a new feature: