dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Errors when approving two FNAL requests (possible duplicate subscriptions) #954

Open DAMason opened 10 years ago

DAMason commented 10 years ago

Greetings,

When I try to approve transfer requests 407431 and 407424 I get an error like:

""" Apologies, looks like we have an internal server error, details of which below. If the problem persists, please submit a bug report.

Error time=2013-12-14 17:26:36 UTC id=306eb01962ae825b712c8ab74db0a4fe

"""

Other requests that have come before and after these were fine. These seem to have been manually created by Julian -- in the comments I see:

This subscription need to be manually created due to failures in WMAgent. They belong to the following workflows pdmvserv_EXO-Fall13-00106_00026_v0131206_200618_2283 pdmvserv_EXO-Fall13-00120_00026_v0__131206_200622_2530 pdmvserv_EXO-Fall13-00130_00026_v0131206_200822_6140

Julian later reported that after he made these requests the agent recovered and made the subscriptions itself. These then became duplicates.

Thanks,

--Dave

TonyWildish commented 10 years ago

Hi Dave,

I'll take a look.

Cheers, Tony.

DAMason commented 10 years ago

Thanks -- will leave the requests alone for now -- though would be nice to clean them up at some point when you no longer need them for debugging...

DAMason commented 10 years ago

FWIW we have another request like this:

Request #410835

The error I got this last time trying to approve:

Apologies, looks like we have an internal server error, details of which below. If the problem persists, please submit a bug report.

Error time=2014-03-25 03:57:07 UTC id=ed1af18271b345447c087fc949602b6b

This and the other two referenced here are kinda left hanging -- what should be done with them?

Thanks!

--Dave

TonyWildish commented 10 years ago

Hi Dave,

sorry for the delay on this, I've had no time at all to look into it. I hope to get to it by the end of this week.

Cheers, Tony.

On 03/25/2014 05:00 AM, DAMason wrote:

FWIW we have another request like this:

Request #410835

The error I got this last time trying to approve:

Apologies, looks like we have an internal server error, details of which below. If the problem persists, please submit a bug report.

Error time=2014-03-25 03:57:07 UTC id=ed1af18271b345447c087fc949602b6b

This and the other two referenced here are kinda left hanging -- what should be done with them?

Thanks!

--Dave

— Reply to this email directly or view it on GitHub https://github.com/dmwm/PHEDEX/issues/954#issuecomment-38528876.

DAMason commented 10 years ago

OK -- seems we have another one -- in fact now about 4 of these guys stacked up at FNAL, the latest I just tried to approve again to give you a recent timestamp:

""" Apologies, looks like we have an internal server error, details of which below. If the problem persists, please submit a bug report.

Error time=2014-04-12 14:41:07 UTC id=ed1af18271b345447c087fc949602b6b

This is from request 412473 """

Apparently whats going on is ops are seeing that the agent doesn't have a record of a subscription being made for some datasets, so then manually go make the custodial subscription themselves. Currently the (FNAL) subscription requests I have in this state are the following:

407424 407431 410835 412473

Would be nice to at least know what can be done with them -- easiest is to just disapprove, but am leaving them around so that you might know what's going wonky here :)

Thanks!

TonyWildish commented 10 years ago

Hi Dave,

so, these are all indeed duplicate requests:

407424

cannot request replica transfer: /MuMinus_Pt-1to150_PositiveEndcap-gun/Fall13-POSTLS162_V1-v4/GEN-SIM already subscribed to T1_US_FNAL_MSS as move

407431

cannot request replica transfer: /WprimeToENu_M_3800_Tune4C_13TeV_pythia8/Fall13-POSTLS162_V1-v1/GEN-SIM already subscribed to T1_US_FNAL_MSS as move

410835

cannot request replica transfer: /QCD_Pt-120to170_MuEnrichedPt5_Tune4C_13TeV_pythia8/Fall13dr-tsg_PU20bx25_POSTLS162_V2-v1/AODSIM already subscribed to T1_US_FNAL_MSS as move

412473

/TZJetsTo3LNuB_FCNC_zeta_zut_8TeV_madgraph/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM already subscribed to T1_US_FNAL_MSS with different custodiality

you should go ahead and disapprove them.

From my side, I need to examine the UpdateRequests API which is giving this error message. The API traps all errors and reports this generic error instead of the details, because it doesn't fully trust that the errors won't leak sensitive information. I can filter the useful error messages and just pass them on to the user.

So I've updated the title of this issue and will leave it open until it's fixed, hopefully in the first release after Easter.

Cheers, Tony.

DAMason commented 10 years ago

Hi Tony,

Thanks — yes passing a more instructive error message to the requestor would be the best thing here.

Thanks!

—Dave

On Apr 14, 2014, at 6:09 AM, Tony Wildish notifications@github.com<mailto:notifications@github.com> wrote:

Hi Dave,

so, these are all indeed duplicate requests:

407424

cannot request replica transfer: /MuMinus_Pt-1to150_PositiveEndcap-gun/Fall13-POSTLS162_V1-v4/GEN-SIM already subscribed to T1_US_FNAL_MSS as move

407431

cannot request replica transfer: /WprimeToENu_M_3800_Tune4C_13TeV_pythia8/Fall13-POSTLS162_V1-v1/GEN-SIM already subscribed to T1_US_FNAL_MSS as move

410835

cannot request replica transfer: /QCD_Pt-120to170_MuEnrichedPt5_Tune4C_13TeV_pythia8/Fall13dr-tsg_PU20bx25_POSTLS162_V2-v1/AODSIM already subscribed to T1_US_FNAL_MSS as move

412473

/TZJetsTo3LNuB_FCNC_zeta_zut_8TeV_madgraph/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM already subscribed to T1_US_FNAL_MSS with different custodiality

you should go ahead and disapprove them.

From my side, I need to examine the UpdateRequests API which is giving this error message. The API traps all errors and reports this generic error instead of the details, because it doesn't fully trust that the errors won't leak sensitive information. I can filter the useful error messages and just pass them on to the user.

So I've updated the title of this issue and will leave it open until it's fixed, hopefully in the first release after Easter.

Cheers, Tony.

— Reply to this email directly or view it on GitHubhttps://github.com/dmwm/PHEDEX/issues/954#issuecomment-40354786.