dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Oracle deadlock in JobStatusLite #5213

Closed amaltaro closed 6 years ago

amaltaro commented 10 years ago

Creating this issue so we do not lose track of it.

Here is part the traceback got back from the JobStatusLite logs

2014-07-09 11:29:16,067:ERROR:StatusPoller:Unhandled error in statusPoller(DatabaseError) ORA-00060: deadlock detected while waiting for resource
 'UPDATE wmbs_job\n          SET location = COALESCE((SELECT id FROM wmbs_location WHERE cms_name = :location), location)\n          WHERE id = :jobID' [{'location': 'T2_ES_IFCA', '
jobID': 1586437}, {'location': 'T2_ES_IFCA', 'jobID': 1586439}, {'location': 'T2_ES_IFCA', 'jobID': 1587964}, {'location': 'T2_ES_IFCA', 'jobID': 1587965}, {'location': 'T2_ES_IFCA'
, 'jobID': 1588495}, {'location': 'T2_ES_IFCA', 'jobID': 1589503}, {'location': 'T2_ES_IFCA', 'jobID': 1616563}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616564}, {'location': 'T2_ES
_IFCA', 'jobID': 1616565}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616566}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616567}, {'location': 'T2_IT_Pisa', 'jobID': 1616918}, {'location
': 'T2_US_Nebraska', 'jobID': 1618186}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1618817}]
Traceback (most recent call last):
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 65, in algorithm
    self.checkStatus()
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 91, in checkStatus
    runningJobs = self.bossAir.track()
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 564, in track
    self._updateJobs(jobs = jobsToChange)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 307, in _updateJobs
    self.stateMachine.recordLocationChange(jobsWithLocation)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/JobStateMachine/ChangeState.py", line 538, in recordLocationChange
    transaction = self.existingTransaction())
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/WMBS/MySQL/Jobs/UpdateLocation.py", line 34, in execute
    transaction = transaction)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 167, in processData
    returnCursor=returnCursor))
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 114, in executemanybinds
    result = connection.execute(s, b)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 824, in execute
    return Connection.executors[c](self, object, multiparams, params)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 888, in _execute_text
    return self.__execute_context(context)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 894, in __execute_context
    self._cursor_executemany(context.cursor, context.statement, context.parameters, context=context)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 960, in _cursor_executemany
    self._handle_dbapi_exception(e, statement, parameters, cursor, context)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_excepti
on
    raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
DatabaseError: (DatabaseError) ORA-00060: deadlock detected while waiting for resource
 'UPDATE wmbs_job\n          SET location = COALESCE((SELECT id FROM wmbs_location WHERE cms_name = :location), location)\n          WHERE id = :jobID' [{'location': 'T2_ES_IFCA', '
jobID': 1586437}, {'location': 'T2_ES_IFCA', 'jobID': 1586439}, {'location': 'T2_ES_IFCA', 'jobID': 1587964}, {'location': 'T2_ES_IFCA', 'jobID': 1587965}, {'location': 'T2_ES_IFCA'
, 'jobID': 1588495}, {'location': 'T2_ES_IFCA', 'jobID': 1589503}, {'location': 'T2_ES_IFCA', 'jobID': 1616563}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616564}, {'location': 'T2_ES_IFCA', 'jobID': 1616565}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616566}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616567}, {'location': 'T2_IT_Pisa', 'jobID': 1616918}, {'location': 'T2_US_Nebraska', 'jobID': 1618186}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1618817}]
2014-07-09 11:29:16,223:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMCore.BossAir.StatusPoller.StatusPoller instance at 0x1272ab8> StatusPollerException
Message: Unhandled error in statusPoller(DatabaseError) ORA-00060: deadlock detected while waiting for resource
 'UPDATE wmbs_job\n          SET location = COALESCE((SELECT id FROM wmbs_location WHERE cms_name = :location), location)\n          WHERE id = :jobID' [{'location': 'T2_ES_IFCA', 'jobID': 1586437}, {'location': 'T2_ES_IFCA', 'jobID': 1586439}, {'location': 'T2_ES_IFCA', 'jobID': 1587964}, {'location': 'T2_ES_IFCA', 'jobID': 1587965}, {'location': 'T2_ES_IFCA', 'jobID': 1588495}, {'location': 'T2_ES_IFCA', 'jobID': 1589503}, {'location': 'T2_ES_IFCA', 'jobID': 1616563}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616564}, {'location': 'T2_ES_IFCA', 'jobID': 1616565}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616566}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1616567}, {'location': 'T2_IT_Pisa', 'jobID': 1616918}, {'location': 'T2_US_Nebraska', 'jobID': 1618186}, {'location': 'T2_FR_CCIN2P3', 'jobID': 1618817}]
        ModuleName : WMCore.BossAir.StatusPoller
        MethodName : algorithm
        ClassInstance : None
        FileName : /data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py
        ClassName : None
        LineNumber : 78
        ErrorNr : 0
Traceback: 
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 65, in algorithm
    self.checkStatus()

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 91, in checkStatus
    runningJobs = self.bossAir.track()

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 564, in track
    self._updateJobs(jobs = jobsToChange)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 307, in _updateJobs
    self.stateMachine.recordLocationChange(jobsWithLocation)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/JobStateMachine/ChangeState.py", line 538, in recordLocationChange
    transaction = self.existingTransaction())

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/WMBS/MySQL/Jobs/UpdateLocation.py", line 34, in execute
    transaction = transaction)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 167, in processData
    returnCursor=returnCursor))

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 114, in executemanybinds
    result = connection.execute(s, b)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 824, in execute
    return Connection.executors[c](self, object, multiparams, params)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 888, in _execute_text
    return self.__execute_context(context)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 894, in __execute_context
    self._cursor_executemany(context.cursor, context.statement, context.parameters, context=context)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 960, in _cursor_executemany
    self._handle_dbapi_exception(e, statement, parameters, cursor, context)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception
    raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)

  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 163, in __call__
    self.algorithm(parameters)
  File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 78, in algorithm
amaltaro commented 10 years ago

@ticoann , I would raise it to high prio. It happens quite often and it brings the component down (so wmstats does not get updated). We should try this one for the next stable release.

ticoann commented 10 years ago

It is not exactly the same place but there is deadlock issue found in ErrorHandler as well

2014-10-10 00:08:15,040:ERROR:BaseWorkerThread:Error in worker algorithm (1): Backtrace: <WMComponent.ErrorHandler.ErrorHandlerPoller.ErrorHandlerPoller instance at 0x19d9b00> ErrorHandlerException Message: Caught exception in ErrorHandler (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'DELETE FROM wmbs_sub_files_acquired\n WHERE subscription = %s AND fileid = %s' (5823L, 591602L)Traceback (most recent call last): File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 376, in algorithm self.handleErrors() File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 310, in handleErrors self.handleRetryDoneJobs(jobList) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 268, in handleRetryDoneJobs self.exhaustJobs(jobList) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 131, in exhaustJobs job.failInputFiles() File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/WMBS/Job.py", line 377, in failInputFiles transaction = self.existingTransaction) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/WMBS/MySQL/Jobs/FailInput.py", line 48, in execute transaction = transaction) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 179, in processData returnCursor=returnCursor) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/Database/MySQLCore.py", line 127, in executebinds return DBInterface.executebinds(self, s, b, connection, returnCursor) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 65, in executebinds resultProxy = connection.execute(s, b) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 824, in execute return Connection.executors[c](self, object, multiparams, params) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 888, in _execute_text return self.__execute_context(context) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 896, in __execute_context self._cursor_execute(context.cursor, context.statement, context.parameters[0], context=context) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 950, in _cursor_execute self._handle_dbapi_exception(e, statement, parameters, cursor, context) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect) OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'DELETE FROM wmbs_sub_files_acquired\n WHERE subscription = %s AND fileid = %s' (5823L, 591602L)

amaltaro commented 10 years ago

Yuyi, Julian just created an elog reporting another deadlock with the same table (but with another component - JobTracker). It happened in cmssrv98 (local mysql database, may be easier for you to debug issues in a local db).

There is a very big error message in the component, but part of it is (hope that helps a bit):

2014-10-20 07:14:52,772:INFO:JobTrackerPoller:18 jobs are complete in BossAir
2014-10-20 07:16:33,140:ERROR:JobTrackerPoller:Unknown exception in JobTracker!
(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE wmbs_job SET fwjr_path = %s WHERE id = %s' [('/storage/local/data1/cmsdataops/srv/wmagent/v0.9.95b/install/wmagent/JobCreator/JobCache/pdmvserv_t
ask_SMP-Summer12WMLHE-00009__v1_T_141019_144522_1077/SMP-Summer12WMLHE-00009_0/JobCollection_85220_0/job_805359/Report.0.pkl', 805359L), 
....
....
        ModuleName : WMComponent.JobTracker.JobTrackerPoller
        MethodName : algorithm
        ClassInstance : None
        FileName : /data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py
        ClassName : None
        LineNumber : 110
        ErrorNr : 0

Traceback: 
  File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 97, in algorithm
    self.trackJobs()

  File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 151, in trackJobs
    self.passJobs(passedJobs)
...
...
yuyiguo commented 10 years ago

HI, Alan:

Could you give me access to cmssrv98? Thanks, Yuyi

From: Alan Malta Rodrigues [mailto:notifications@github.com] Sent: Monday, October 20, 2014 8:31 AM To: dmwm/WMCore Cc: Yuyi Guo Subject: Re: [WMCore] Oracle deadlock in JobStatusLite (#5213)

Yuyi, Julian just created an elog reporting another deadlock with the same table (but with another component). It happened in cmssrv98 (local mysql database, may be easier for you to debug issues in a local db).

There is a very big error message in the component, but part of it is (hope that helps a bit):

2014-10-20 07:14:52,772:INFO:JobTrackerPoller:18 jobs are complete in BossAir

2014-10-20 07:16:33,140:ERROR:JobTrackerPoller:Unknown exception in JobTracker!

(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE wmbs_job SET fwjr_path = %s WHERE id = %s' [('/storage/local/data1/cmsdataops/srv/wmagent/v0.9.95b/install/wmagent/JobCreator/JobCache/pdmvserv_t

ask_SMP-Summer12WMLHE-00009__v1_T_141019_144522_1077/SMP-Summer12WMLHE-00009_0/JobCollection_85220_0/job_805359/Report.0.pkl', 805359L),

....

....

    ModuleName : WMComponent.JobTracker.JobTrackerPoller

    MethodName : algorithm

    ClassInstance : None

    FileName : /data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py

    ClassName : None

    LineNumber : 110

    ErrorNr : 0

Traceback:

File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 97, in algorithm

self.trackJobs()

File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 151, in trackJobs

self.passJobs(passedJobs)

...

...

— Reply to this email directly or view it on GitHubhttps://github.com/dmwm/WMCore/issues/5213#issuecomment-59753617.

amaltaro commented 10 years ago

Sent you a private mail.

yuyiguo commented 10 years ago

Thanks Alan.

I have contacted Kate for checking the trace file of Oracle DB. It is very hard for her to trace down the problem because we are using the load balancing. Can we change the DB connection to a fixed node?

Here is the tns name that we can connect to: CMSR_NOLB=( DESCRIPTION= (ADDRESS= (PROTOCOL=TCP) (HOST=cmsr2-s.cern.ch) (PORT=10121) ) (LOAD_BALANCE=on) (ENABLE=BROKEN) (CONNECT_DATA= (SERVER=DEDICATED) (SERVICE_NAME=cmsr_nolb.cern.ch) ) )

amaltaro commented 10 years ago

I guess it's the same problem hitting oracle, mysql and mariadb here, isn't it? If so, don't you prefer to "debug" it in a local DB (FNAL agents)?

If you really want to debug it on the Oracle side, then I would prefer to setup a testbed agent with a preprod oracle instance (though we basically have no resources at CERN at this moment).

yuyiguo commented 10 years ago

Starting from version 5.6.2 mysql can be configured to log all the deadlocks (http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_print_all_deadlocks). We have mysql version 5.1.58. According to the mysql manuel, that SHOW ENGINE INNODB STATUS can show the last deadlock, but I did not see any deadlock records while running this command on cmssrv98. Maybe the reported deadlock was on a different mysql or our version was too old to record any deadlock?

Anyway, it does not look good to me for using mysql to debug the problem unless we move to 5.6.2 or higher version.

If we can change the connection to cms_wmbs_prod5 that is the oracle DB has most data to a fixed oracle connection, then we can ask the DBA to send us Oracle trace files to have a closed look at the deadlocks. She had said that with the current connection configuration, she could not do anything about the deadlock debugging. Changing the connection will not affect any DB operation and we will continue to write to the same DB, but we need to drop all the current connections and recreate the connections.

Create a test stand to write to the test db may not help us to get the deadlock unless there are enough sessions to update the WMBS_JOB table simultaneously. Any other ideas?

yuyiguo commented 10 years ago

The MariaDB we have been testing is equivalent to mysql 5.5 as Seangchan pointed to me. And that is the current stable version.

ticoann commented 10 years ago

Sorry Yuyi, I gave you wrong information, It seems there are stable version for 10.0.14. https://downloads.mariadb.org/ And our testing version is 10.0.12 https://github.com/cms-sw/cmsdist/blob/comp_gcc481/mariadb.spec

However, no production agent is using MariaDB yet.

hufnagel commented 10 years ago

The MariaDB 10 series is not entirely based on MySQL 5.5. It's a mix of 5.5 and 5.6 and not directly comparable anymore. The innodb_print_all_deadlocks parameter was introduced in MariaDB 5.5.30 btw, so it should definitely be in 10.

yuyiguo commented 10 years ago

Thanks Dirk.

Can someone let me know on which node we are testing/using MariaDB? I will take a look what we have right now.

yuyiguo commented 10 years ago

@amaltaro Alan: the commandline switch is --innodb-print-all-deadlocks=1. See https://mariadb.com/kb/en/mariadb/documentation/storage-engines/xtradb-and-innodb/xtradbinnodb-server-system-variables/#innodb_print_all_deadlocks for details.

amaltaro commented 10 years ago

Thanks Yuyi! I applied this change innodb_print_all_deadlocks=1 to cmsgwms-submit1 and cmssrv219 mysql conf. Components and services were restarted.

yuyiguo commented 9 years ago

InnoDB error log show two deadlocks recorded after Alan set innodb_print_all_deadlocks=1 for the server. However, the log still not enough for debugging the problem. We have to turn on a several more switches in order to get more for the deadlock log. The current log just recorded part of the deadlock, we cannot build the whole picture. The wmagent log only logged one of the DMLs that caused the deadlock. So it did not give us the other half either.

Here is a list of things we need to turn on to get more info when deadlock happens:

  1. Enable InnoDB Lock Monitor (http://dev.mysql.com/doc/refman/5.5/en/innodb-enabling-monitors.html) CREATE TABLE innodb_lock_monitor (a INT) ENGINE=INNODB; to disable: DROP TABLE innodb_lock_monitor;
  2. Run "mysqladmin debug" (http://dev.mysql.com/doc/refman/5.5/en/mysqladmin.html)
  3. Using the InnoDB Plugin (http://dev.mysql.com/doc/refman/5.5/en/innodb-i_s-tables.html) to monitor the innoDB : 3a information on install the plugin: http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-plugin-installation-dynamic-posix.html 3b Run sql: SELECT * FROM INFORMATION_SCHEMA.INNODB_LOCKS, INNODB_LOCK_WAITS, and INNODB_TRX tables

After we are done the above steps, we may get all the necessary logs for the deadlock. However, I am not sure how long the certain info will stay in the information_schema tables or not overwrite by new coming.

yuyiguo commented 9 years ago

Copied the log files here so we have all the reference in the issue.

Mysql error log:

2014-11-11 17:25:49 7f7e78681700InnoDB: transactions deadlock detected, dumping detailed information. 2014-11-11 17:25:49 7f7e78681700 * (1) TRANSACTION: TRANSACTION 10821930, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 3 lock struct(s), heap size 1184, 9 row lock(s), undo log entries 8 MySQL thread id 290167, OS thread handle 0x7f7e781e9700, query id 291873984 localhost cmsdataops updating UPDATE bl_runjob SET status = '0' WHERE id = 969501 * (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 51 page no 2699 n bits 352 index PRIMARY of table wmagent.bl_runjob trx id 10821930 lock_mode X locks rec but not gap waiting * (2) TRANSACTION: TRANSACTION 10821928, ACTIVE 1 sec starting index read mysql tables in use 1, locked 1 4 lock struct(s), heap size 1184, 99 row lock(s), undo log entries 98 MySQL thread id 290188, OS thread handle 0x7f7e78681700, query id 291874030 localhost cmsdataops updating UPDATE bl_runjob SET status = '0' WHERE id = 969487 * (2) HOLDS THE LOCK(S): RECORD LOCKS space id 51 page no 2699 n bits 352 index PRIMARY of table wmagent.bl_runjob trx id 10821928 lock_mode X locks rec but not gap * (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 51 page no 2699 n bits 352 index PRIMARY of table wmagent.bl_runjob trx id 10821928 lock_mode X locks rec but not gap waiting * WE ROLL BACK TRANSACTION (1)

ComponentLog

2014-11-11 17:25:50,771:INFO:Harness:>>>Terminating worker threads 2014-11-11 17:25:50,837:ERROR:BaseWorkerThread:Error in event loop (2): <WMCore.BossAir.StatusPoller.StatusPoller instance at 0x2f4b908> StatusPollerException Message: Unhandled error in statusPoller(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') "UPDATE bl_runjob SET status = '0' WHERE id = %s" [(969487L,), (969488L,), (969497L,), (969489L,), (969490L,), (969493L,), (969494L,), (969496L,) ... displaying 10 of 95 total bound parameter sets ... (969480L,), (969481L,)] ModuleName : WMCore.BossAir.StatusPoller MethodName : algorithm ClassInstance : None FileName : /data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py ClassName : None LineNumber : 78 ErrorNr : 0

Traceback: File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 65, in algorithm self.checkStatus()

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 91, in checkStatus runningJobs = self.bossAir.track()

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 566, in track self._complete(jobs = jobsToComplete)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 618, in _complete transaction = self.existingTransaction())

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/MySQL/CompleteJob.py", line 41, in execute transaction = transaction)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 167, in processData returnCursor=returnCursor))

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/MySQLCore.py", line 140, in executemanybinds returnCursor)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 114, in executemanybinds result = connection.execute(s, b) File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 719, in execute return self._execute_text(object, multiparams, params)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 868, in _execute_text statement, parameters

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context context)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1116, in _handle_dbapi_exception exc_info

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 189, in raise_from_cause reraise(type(exception), exception, tb=exc_tb)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 924, in _execute_context context)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/connectors/mysqldb.py", line 67, in do_executemany rowcount = cursor.executemany(statement, parameters)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-mysqldb/1.2.4b4/lib/python2.6/site-packages/MySQLdb/cursors.py", line 234, in executemany r = r + self.execute(query, a)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-mysqldb/1.2.4b4/lib/python2.6/site-packages/MySQLdb/cursors.py", line 202, in execute self.errorhandler(self, exc, value)

File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-mysqldb/1.2.4b4/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue

Backtrace: File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 198, in call raise ex

2014-11-11 17:25:50,838:INFO:BaseWorkerThread:Worker thread <WMCore.BossAir.StatusPoller.StatusPoller instance at 0x2f4b908> terminated

Analysis/my understood

  1. The mysql deadlock log did not give us the whole picture what exactly caused the deadlock, but it pointed out two update statements:
    • _UPDATE blrunjob SET status = '0' WHERE id = 969501
    • _UPDATE blrunjob SET status = '0' WHERE id = 969487
  2. The ComponentLog show below sql caused the deadlock: *_UPDATE blrunjob SET status = '0' WHERE id = %s" [(969487L,), (969488L,), (969497L,), (969489L,), (969490L,), (969493L,), (969494L,), (969496L,) ... displaying 10 of 95 total bound parameter sets ... (969480L,), (969481L,)] This is a error message from sqlalchemy.
  3. The SQL code that updates the status is:
    • _UPDATE blrunjob SET status = '0' WHERE id = :id
  4. Even through the code did bind 95 ids to run executemany API to update the status in a one shot as one hoped, the actual DML has to execute one id at a time in mysql. This means that the agent had to request 95 or more lockers one at a time. This is a very high possibility to cause a dead lock.
  5. To improve it, we can either continue to use the same sql, but update, then commit one row or a few rows at a time instead of 95 or more rows. Or we can rewrite the sql to be something like:
    • _UPDATE blrunjob SET status = '0' WHERE id in (n1,n2...); The the list (n1,n2...) can be generated on fly.
  6. This will reduce the deadlock. the really problem with the deadlock is more in the schema that we may not want to touch it?
  7. There are more than one kind of deadlocks. Here is another one in the same error log:

2014-10-28 23:13:17 7f7e783a2700InnoDB: transactions deadlock detected, dumping detailed information. 2014-10-28 23:13:17 7f7e783a2700 * (1) TRANSACTION: TRANSACTION 2908822, ACTIVE 37 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 16 lock struct(s), heap size 2936, 23 row lock(s), undo log entries 15 MySQL thread id 215830, OS thread handle 0x7f7e781e9700, query id 71363435 localhost cmsdataops updating UPDATE wmbs_fileset SET open = 1, last_update = 1414555997 WHERE name = '/dmason_HCA-Spring14dr-Backfill-00002_00187_v1__141022_060213_3297/StepOneProc/unmerged-logArchive' * (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 4 page no 5 n bits 168 index PRIMARY of table wmagent.wmbs_fileset trx id 2908822 lock_mode X locks rec but not gap waiting * (2) TRANSACTION: TRANSACTION 2908821, ACTIVE 38 sec inserting mysql tables in use 2, locked 1 17 lock struct(s), heap size 2936, 108 row lock(s), undo log entries 495 MySQL thread id 217627, OS thread handle 0x7f7e783a2700, query id 71363437 localhost cmsdataops Sending data INSERT INTO wmbs_fileset_files (fileid, fileset, insert_time) SELECT id, 13, 1414555997 FROM wmbs_file_details WHERE lfn = '/store/unmerged/Spring14drBackfill/QCD_Pt-5to3000_Tune4C_Flat_13TeV_pythia8/GEN-SIM-RECO/BACKFILL-v6/10076/7ADA4756-1D5F-E411-92E2-7845C4FC3B18.root' * (2) HOLDS THE LOCK(S): RECORD LOCKS space id 4 page no 5 n bits 168 index PRIMARY of table wmagent.wmbs_fileset trx id 2908821 lock mode S locks rec but not gap * (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 4 page no 5 n bits 168 index PRIMARY of table wmagent.wmbs_fileset trx id 2908821 lock mode S locks rec but not gap waiting * WE ROLL BACK TRANSACTION (1)

yuyiguo commented 9 years ago

@amaltaro

After carefully looking into the mysql on cmsgwms-submit1.fnal.gov, it seemed that the new mysql plugin was already in the current mysql server because the three tables that the plugin brings in is already in the DB, however, there is no data in the table. Reading online(http://www.chriscalender.com/debugging-innodb-locks-using-the-new-innodb-plugin%E2%80%99s-information-schema-tables/), we need to change the config file and restart mysql.

Alan, Could please do below?

  1. ) Config the InnoDB plugin (just added the following 2 lines to mysql config file and restarted mysqld): ignore_builtin_innodb plugin-load=innodb=ha_innodb_plugin.dll;innodb_trx=ha_innodb_plugin.dll; innodb_locks=ha_innodb_plugin.dll;innodb_lock_waits=ha_innodb_plugin.dll; innodb_cmp=ha_innodb_plugin.dll;innodb_cmp_reset=ha_innodb_plugin.dll; innodb_cmpmem=ha_innodb_plugin.dll;innodb_cmpmem_reset=ha_innodb_plugin.dll

2.) Enabling the InnoDB Lock Monitor: CREATE TABLE innodb_lock_monitor (a INT) ENGINE=INNODB;

3.)Run "mysqladmin debug" Tell the server to write debug in the error log.

Then we will wait for the deadlock. When we get the deadlock, please let me know and I will need to get the lock info from the tables before we can restart the agents.

ericvaandering commented 9 years ago

If we get useful info out of this, I assume we can put this in the spec file for the future?

amaltaro commented 9 years ago

Yuyi, I do not know where you see 2 lines only :-) I added the lines you provided in 1) to this file: /data/srv/wmagent/current/config/mysql/my.cnf but then I'm not able to start mysql, it gets stuck with this output

Starting mysql...
starting mysqld_safe...
Checking MySQL Socket file exists...

BTW, don't we use dll libraries for windows only? Maybe there are different setups for Linux box? BTW 2, since cmsgwms-submit1 is now being used for real production, I'm trying to replicate these configurations in our testing box (cmssrv113).

Just in case, I also paste here a diff between the original file and the new one:

[cmsdataops@cmssrv113 current]$ diff -rup config/mysql/my.cnf{.ori,}
--- config/mysql/my.cnf.ori 2014-12-10 10:00:45.786283947 -0600
+++ config/mysql/my.cnf 2014-12-10 10:04:09.009188281 -0600
@@ -34,6 +34,18 @@ innodb_checksums=0
 innodb_doublewrite=0
 innodb_locks_unsafe_for_binlog = true

+# write deadlock information to the error log
+innodb_print_all_deadlocks=1
+ignore_builtin_innodb 
+plugin-load=innodb=ha_innodb_plugin.dll
+innodb_trx=ha_innodb_plugin.dll
+innodb_locks=ha_innodb_plugin.dll
+innodb_lock_waits=ha_innodb_plugin.dll
+innodb_cmp=ha_innodb_plugin.dll
+innodb_cmp_reset=ha_innodb_plugin.dll
+innodb_cmpmem=ha_innodb_plugin.dll
+innodb_cmpmem_reset=ha_innodb_plugin.dll
yuyiguo commented 9 years ago

Alan, sorry for the trouble. Anything is current running on cmssrv113? If not, could you give me access to the machine? I will figure out it using the testbed before I ask you try it on cmsgwms-submit1.

amaltaro commented 9 years ago

Julian is running a test there, if possible we should try not to kill that, but if we do... then it's not the end of the world :-) cmssrv113 is at FNAL, so it uses the usual cmsdataops user. If that does not work for you, talk to Krista plz.

yuyiguo commented 9 years ago

NP. I can wait. Julian, please let me know when the test is done.

amaltaro commented 9 years ago

It seems the machine is free now. Go ahead.

yuyiguo commented 9 years ago

I setup mariadb on cmssrv113 with the new configureation, disable build-in innodb and enabled plugin innodb. These changes include: removing some of the deprecated gloabl values and changing to new set of parameters and turning off autocommit. By default, mariadb is autocommit on. I don't know how the wmagent API control the transactions. If the autocommit is on, all the DMLs comitted as soon as thay are executed? This will increase the chance of deadlock.

I also ran mysqladmin by hand to send debug info into error log. Due to all the logs and monitoring, the db server will run slower than normal.

Julian, could you please run some tests on this node? In case you need to resart the db server, I have to run some commands manully before you run the tests.

Let me know if you see any differece during the test regarding the speed of db access or any other issues? I will examin the logs during the tests.

amaltaro commented 9 years ago

@yuyiguo thanks for setting up cmssrv113 to deadlock debug mode. Just a reminder:

I've just managed to replicate the/a deadlock in vocms0304, JobStatusLite component is dead at the moment and I'll wait for you to look at it. I'm sending you more details in a private mail.

yuyiguo commented 9 years ago

Oracle deadlock was reproduced and trace file showed that the problem was caused by missing indexes on foreign keys. So I have added below indexes. CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_WMBS ON CMS_WMBS_PROD9.BL_RUNJOB(WMBS_ID); CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_LOCATION ON CMS_WMBS_PROD9.BL_RUNJOB(LOCATION); CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_STATUS ON CMS_WMBS_PROD9.BL_RUNJOB(SCHED_STATUS); CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_USERS ON CMS_WMBS_PROD9.BL_RUNJOB(USER_ID);

Alan will test the change by reproduce the deadlock.

amaltaro commented 7 years ago

I haven't seen this happening again in Oracle. Since we have mostly used CERN agents for the last half year, I haven't seen this in MySQL as well. I actually don't remember if we managed to fix anything on the mysql backend.