dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

BlockAllocator holding locks on subscriptions for a long time #1032

Closed nikmagini closed 7 years ago

nikmagini commented 8 years ago

Reported by Jean-Roch: the 'updatesubscriptions' API occasionally takes a long time, up to several minutes, even reaching timeout in the frontends:

 ERROR - phedexApi.updateSubscription failed for site: T1_IT_CNAF_Disk
ERROR - self.phedexCall with response:  Error - urllib2.HTTPError <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/auth/complete/phedex/datasvc/json/prod/updatesubscription">POST&nbsp;/auth/complete/phedex/datasvc/json/prod/updatesubscription</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

  URL: https://cmsweb.cern.ch/phedex/datasvc/json/prod/updatesubscription
  VALUES: {'node': u'T1_IT_CNAF_Disk', 'group': 'AnalysisOps', 'dataset': '/BprimeBToHB_M-700_TuneCUETP8M1_13TeV-madgraph-pythia8/RunIISpring16MiniAODv2-PUSpring16RAWAODSIM_reHLT_80X_mcRun2_asymptotic_v14-v1/MINIAODSIM'}
Thu Jun 30 09:17:02 2016

 succeeded on second trial. 

Seems to be caused by locks held on the T_DPS_SUBS_DATASET/BLOCK/PARAM tables by the BlockAllocator agent while updating block destinations - consistent with the fact that the agent has a 5-10 minute cycle time nowadays. Looking in session manager, I see that the updatesubscriptions API is locked trying to perform the following update:

update t_dps_subs_dataset set param= :param where destination = :destination and dataset = (select id from t_dps_dataset where name = :dataset)

While BlockAllocator is busy in the following query:

select bd.destination, n.name destination_name, b.dataset dataset, b.id block, b.name block_name, b.is_open, sp.priority subs_priority, s.is_move subs_move, s.time_create subs_create, s.time_complete subs_complete, s.time_done subs_done, s.time_suspend_until subs_suspend, bd.priority bd_priority, bd.state bd_state, bd.time_subscription bd_subscrption, bd.time_create bd_create, bd.time_active bd_active, bd.time_complete bd_complete, bd.time_suspend_until bd_suspend, nvl(br.node_files,0) node_files, nvl(br.src_files,0) src_files, b.files exist_files from t_dps_block_dest bd join t_adm_node n on n.id = bd.destination join t_dps_block b on b.id = bd.block join t_dps_subs_block s on s.destination = bd.destination and s.block = bd.block join t_dps_subs_param sp on sp.id = s.param left join t_dps_block_replica br on br.node = bd.destination and br.block = bd.block where (b.is_open = 'n' and br.node_files >= b.files and bd.state != 3) or (nvl(br.node_files,0)<b.files and bd.state = 3) or (bd.priority != sp.priority) or ((bd.state = 2 or bd.state = 4) and (bd.time_suspend_until is null or bd.time_suspend_until <= :now)) or (bd.state <= 2 and nvl(trunc(bd.time_suspend_until),-1) != nvl(trunc(s.time_suspend_until),-1)) or (bd.state < 2 and bd.time_suspend_until is not null and bd.time_suspend_until > :now)
nikmagini commented 7 years ago

Fix released in 4.2.0

The BlockAllocator agent cycle is now split into two SQL transactions: one to update the t_dpssubs* tables, and another one for the t_dps_block_dest table, so the website can update subs while the agent is working on the dests.