dmwm / CRABServer

15 stars 38 forks source link

hardcode ASO worker in ASO Publisher query #5830

Closed belforte closed 5 years ago

belforte commented 5 years ago

Introducing a second ASO publishers for asoworker=asoless resulted in Oracle not being able to optimize the execution plan fo the query and when asoworker=asoprody query time explode. See discussion upon report from CERN DBA in: https://cern.service-now.com/service-portal/view-incident.do?n=INC1930071&view=itil&width=default

Here's relevant text

Mail from Frank:
From: Franck Pachot
Sent: 28 February 2019 14:44
To: Todor Trendafilov Ivanov
Subject: Long queries and contention on CMSR database, schema cms_analysis_reqmgr
Hi Todor,
This morning we have seen big contention on CMSR database - schema cms_analysis_reqmgr_r for wich you are registered as the application owner.
There was several concurrent executions of the following query:
SELECT f.tm_publication_state, f.tm_transfer_state, f.tm_aso_worker, f.tm_taskname, f.tm_username, f.tm_destination, f.tm_source_lfn, f.tm_destination_lfn, f.tm_last_update, t.tm_user_role, t.tm_user_group, t.tm_input_dataset, t.tm_cache_url,t.tm_dbs_url
FROM filetransfersdb f LEFT OUTER JOIN tasks t ON t.tm_taskname = f.tm_taskname
WHERE f.tm_username = :username AND f.tm_publication_state = :state AND f.tm_transfer_state = :transfer_state AND f.tm_aso_worker = :asoworker AND rownum < :limit ORDER BY rownum
which all failed after hours exhausting all cluster resources:
It seems that a different execution plan was used because the execution that parsed the query was running with :ASOWORKER='asoless' which has few rows and the other executions running with with :ASOWORKER='asoprod1'
For columns that have very few different values but very different data distribution, it should be passed as a literal value in the query and not as a bind variable.

When asked more specifically as to what to change, Frank offered the following:

>> QUESTION: how exactly should the SQL be changed ?
This should be tested of course, but probably passing the ASOWORKER as a literal (the characer string in the SQL statement) rqather than as a bind variable would be better. The the query will be optimized for each value rather than having one execution plan that cover both (and maybe the plan optimized for one value os not optimal for the other value)
belforte commented 5 years ago

DB status became OK after Diego turned off the DBSPublisher for I discussed this with @lecriste @dciangot and @todor-ivanov and we decided to leave current query to ASOLESS since it works and to create a new, ad hoc, one to be used for asoprod1 and to deploy the modified REST only on a dedicated REST instance to be used only for asoprod1 for a week or so until it is drained. @mmascher if you have the time and strength to read, understand, and in case advice... thanks !

belforte commented 5 years ago

this is only being done in ad-hoc branch https://github.com/dmwm/CRABServer/tree/ForAsoprod1

If we ever want to use multiple asoworker names we may have to do something, but not clear what.

belforte commented 5 years ago

this was solved by creating the ad-hoc branch and using it in a server dedicated to remaining ASOv1 work. ASOv1 is now off, forever.