Closed amaltaro closed 6 years ago
@ticoann , I would raise it to high prio. It happens quite often and it brings the component down (so wmstats does not get updated). We should try this one for the next stable release.
It is not exactly the same place but there is deadlock issue found in ErrorHandler as well
2014-10-10 00:08:15,040:ERROR:BaseWorkerThread:Error in worker algorithm (1): Backtrace: <WMComponent.ErrorHandler.ErrorHandlerPoller.ErrorHandlerPoller instance at 0x19d9b00> ErrorHandlerException Message: Caught exception in ErrorHandler (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'DELETE FROM wmbs_sub_files_acquired\n WHERE subscription = %s AND fileid = %s' (5823L, 591602L)Traceback (most recent call last): File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 376, in algorithm self.handleErrors() File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 310, in handleErrors self.handleRetryDoneJobs(jobList) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 268, in handleRetryDoneJobs self.exhaustJobs(jobList) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMComponent/ErrorHandler/ErrorHandlerPoller.py", line 131, in exhaustJobs job.failInputFiles() File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/WMBS/Job.py", line 377, in failInputFiles transaction = self.existingTransaction) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/WMBS/MySQL/Jobs/FailInput.py", line 48, in execute transaction = transaction) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 179, in processData returnCursor=returnCursor) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/Database/MySQLCore.py", line 127, in executebinds return DBInterface.executebinds(self, s, b, connection, returnCursor) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/cms/wmagent/0.9.95b.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 65, in executebinds resultProxy = connection.execute(s, b) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 824, in execute return Connection.executors[c](self, object, multiparams, params) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 888, in _execute_text return self.__execute_context(context) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 896, in __execute_context self._cursor_execute(context.cursor, context.statement, context.parameters[0], context=context) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 950, in _cursor_execute self._handle_dbapi_exception(e, statement, parameters, cursor, context) File "/data/srv/wmagent/v0.9.95b.patch2/sw.pre/slc6_amd64_gcc481/external/py2-sqlalchemy/0.5.2-comp2/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect) OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'DELETE FROM wmbs_sub_files_acquired\n WHERE subscription = %s AND fileid = %s' (5823L, 591602L)
Yuyi, Julian just created an elog reporting another deadlock with the same table (but with another component - JobTracker). It happened in cmssrv98 (local mysql database, may be easier for you to debug issues in a local db).
There is a very big error message in the component, but part of it is (hope that helps a bit):
2014-10-20 07:14:52,772:INFO:JobTrackerPoller:18 jobs are complete in BossAir
2014-10-20 07:16:33,140:ERROR:JobTrackerPoller:Unknown exception in JobTracker!
(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE wmbs_job SET fwjr_path = %s WHERE id = %s' [('/storage/local/data1/cmsdataops/srv/wmagent/v0.9.95b/install/wmagent/JobCreator/JobCache/pdmvserv_t
ask_SMP-Summer12WMLHE-00009__v1_T_141019_144522_1077/SMP-Summer12WMLHE-00009_0/JobCollection_85220_0/job_805359/Report.0.pkl', 805359L),
....
....
ModuleName : WMComponent.JobTracker.JobTrackerPoller
MethodName : algorithm
ClassInstance : None
FileName : /data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py
ClassName : None
LineNumber : 110
ErrorNr : 0
Traceback:
File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 97, in algorithm
self.trackJobs()
File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 151, in trackJobs
self.passJobs(passedJobs)
...
...
HI, Alan:
Could you give me access to cmssrv98? Thanks, Yuyi
From: Alan Malta Rodrigues [mailto:notifications@github.com] Sent: Monday, October 20, 2014 8:31 AM To: dmwm/WMCore Cc: Yuyi Guo Subject: Re: [WMCore] Oracle deadlock in JobStatusLite (#5213)
Yuyi, Julian just created an elog reporting another deadlock with the same table (but with another component). It happened in cmssrv98 (local mysql database, may be easier for you to debug issues in a local db).
There is a very big error message in the component, but part of it is (hope that helps a bit):
2014-10-20 07:14:52,772:INFO:JobTrackerPoller:18 jobs are complete in BossAir
2014-10-20 07:16:33,140:ERROR:JobTrackerPoller:Unknown exception in JobTracker!
(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE wmbs_job SET fwjr_path = %s WHERE id = %s' [('/storage/local/data1/cmsdataops/srv/wmagent/v0.9.95b/install/wmagent/JobCreator/JobCache/pdmvserv_t
ask_SMP-Summer12WMLHE-00009__v1_T_141019_144522_1077/SMP-Summer12WMLHE-00009_0/JobCollection_85220_0/job_805359/Report.0.pkl', 805359L),
....
....
ModuleName : WMComponent.JobTracker.JobTrackerPoller
MethodName : algorithm
ClassInstance : None
FileName : /data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py
ClassName : None
LineNumber : 110
ErrorNr : 0
Traceback:
File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 97, in algorithm
self.trackJobs()
File "/data/srv/wmagent/v0.9.95b/sw/slc5_amd64_gcc461/cms/wmagent/0.9.95b/lib/python2.6/site-packages/WMComponent/JobTracker/JobTrackerPoller.py", line 151, in trackJobs
self.passJobs(passedJobs)
...
...
— Reply to this email directly or view it on GitHubhttps://github.com/dmwm/WMCore/issues/5213#issuecomment-59753617.
Sent you a private mail.
Thanks Alan.
I have contacted Kate for checking the trace file of Oracle DB. It is very hard for her to trace down the problem because we are using the load balancing. Can we change the DB connection to a fixed node?
Here is the tns name that we can connect to: CMSR_NOLB=( DESCRIPTION= (ADDRESS= (PROTOCOL=TCP) (HOST=cmsr2-s.cern.ch) (PORT=10121) ) (LOAD_BALANCE=on) (ENABLE=BROKEN) (CONNECT_DATA= (SERVER=DEDICATED) (SERVICE_NAME=cmsr_nolb.cern.ch) ) )
I guess it's the same problem hitting oracle, mysql and mariadb here, isn't it? If so, don't you prefer to "debug" it in a local DB (FNAL agents)?
If you really want to debug it on the Oracle side, then I would prefer to setup a testbed agent with a preprod oracle instance (though we basically have no resources at CERN at this moment).
Starting from version 5.6.2 mysql can be configured to log all the deadlocks (http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_print_all_deadlocks). We have mysql version 5.1.58. According to the mysql manuel, that SHOW ENGINE INNODB STATUS can show the last deadlock, but I did not see any deadlock records while running this command on cmssrv98. Maybe the reported deadlock was on a different mysql or our version was too old to record any deadlock?
Anyway, it does not look good to me for using mysql to debug the problem unless we move to 5.6.2 or higher version.
If we can change the connection to cms_wmbs_prod5 that is the oracle DB has most data to a fixed oracle connection, then we can ask the DBA to send us Oracle trace files to have a closed look at the deadlocks. She had said that with the current connection configuration, she could not do anything about the deadlock debugging. Changing the connection will not affect any DB operation and we will continue to write to the same DB, but we need to drop all the current connections and recreate the connections.
Create a test stand to write to the test db may not help us to get the deadlock unless there are enough sessions to update the WMBS_JOB table simultaneously. Any other ideas?
The MariaDB we have been testing is equivalent to mysql 5.5 as Seangchan pointed to me. And that is the current stable version.
Sorry Yuyi, I gave you wrong information, It seems there are stable version for 10.0.14. https://downloads.mariadb.org/ And our testing version is 10.0.12 https://github.com/cms-sw/cmsdist/blob/comp_gcc481/mariadb.spec
However, no production agent is using MariaDB yet.
The MariaDB 10 series is not entirely based on MySQL 5.5. It's a mix of 5.5 and 5.6 and not directly comparable anymore. The innodb_print_all_deadlocks parameter was introduced in MariaDB 5.5.30 btw, so it should definitely be in 10.
Thanks Dirk.
Can someone let me know on which node we are testing/using MariaDB? I will take a look what we have right now.
@amaltaro Alan: the commandline switch is --innodb-print-all-deadlocks=1. See https://mariadb.com/kb/en/mariadb/documentation/storage-engines/xtradb-and-innodb/xtradbinnodb-server-system-variables/#innodb_print_all_deadlocks for details.
Thanks Yuyi! I applied this change innodb_print_all_deadlocks=1 to cmsgwms-submit1 and cmssrv219 mysql conf. Components and services were restarted.
InnoDB error log show two deadlocks recorded after Alan set innodb_print_all_deadlocks=1 for the server. However, the log still not enough for debugging the problem. We have to turn on a several more switches in order to get more for the deadlock log. The current log just recorded part of the deadlock, we cannot build the whole picture. The wmagent log only logged one of the DMLs that caused the deadlock. So it did not give us the other half either.
Here is a list of things we need to turn on to get more info when deadlock happens:
After we are done the above steps, we may get all the necessary logs for the deadlock. However, I am not sure how long the certain info will stay in the information_schema tables or not overwrite by new coming.
Copied the log files here so we have all the reference in the issue.
2014-11-11 17:25:49 7f7e78681700InnoDB: transactions deadlock detected, dumping detailed information.
2014-11-11 17:25:49 7f7e78681700
* (1) TRANSACTION:
TRANSACTION 10821930, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1184, 9 row lock(s), undo log entries 8
MySQL thread id 290167, OS thread handle 0x7f7e781e9700, query id 291873984 localhost cmsdataops updating
UPDATE bl_runjob SET status = '0' WHERE id = 969501
* (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 51 page no 2699 n bits 352 index PRIMARY
of table wmagent
.bl_runjob
trx id 10821930 lock_mode X locks rec but not gap waiting
* (2) TRANSACTION:
TRANSACTION 10821928, ACTIVE 1 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1184, 99 row lock(s), undo log entries 98
MySQL thread id 290188, OS thread handle 0x7f7e78681700, query id 291874030 localhost cmsdataops updating
UPDATE bl_runjob SET status = '0' WHERE id = 969487
* (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 51 page no 2699 n bits 352 index PRIMARY
of table wmagent
.bl_runjob
trx id 10821928 lock_mode X locks rec but not gap
* (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 51 page no 2699 n bits 352 index PRIMARY
of table wmagent
.bl_runjob
trx id 10821928 lock_mode X locks rec but not gap waiting
* WE ROLL BACK TRANSACTION (1)
2014-11-11 17:25:50,771:INFO:Harness:>>>Terminating worker threads 2014-11-11 17:25:50,837:ERROR:BaseWorkerThread:Error in event loop (2): <WMCore.BossAir.StatusPoller.StatusPoller instance at 0x2f4b908> StatusPollerException Message: Unhandled error in statusPoller(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') "UPDATE bl_runjob SET status = '0' WHERE id = %s" [(969487L,), (969488L,), (969497L,), (969489L,), (969490L,), (969493L,), (969494L,), (969496L,) ... displaying 10 of 95 total bound parameter sets ... (969480L,), (969481L,)] ModuleName : WMCore.BossAir.StatusPoller MethodName : algorithm ClassInstance : None FileName : /data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py ClassName : None LineNumber : 78 ErrorNr : 0
Traceback: File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 65, in algorithm self.checkStatus()
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/StatusPoller.py", line 91, in checkStatus runningJobs = self.bossAir.track()
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 566, in track self._complete(jobs = jobsToComplete)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/BossAirAPI.py", line 618, in _complete transaction = self.existingTransaction())
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/BossAir/MySQL/CompleteJob.py", line 41, in execute transaction = transaction)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 167, in processData returnCursor=returnCursor))
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/MySQLCore.py", line 140, in executemanybinds returnCursor)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 114, in executemanybinds result = connection.execute(s, b) File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 719, in execute return self._execute_text(object, multiparams, params)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 868, in _execute_text statement, parameters
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context context)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1116, in _handle_dbapi_exception exc_info
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 189, in raise_from_cause reraise(type(exception), exception, tb=exc_tb)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 924, in _execute_context context)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-sqlalchemy/0.9.6/lib/python2.6/site-packages/sqlalchemy/connectors/mysqldb.py", line 67, in do_executemany rowcount = cursor.executemany(statement, parameters)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-mysqldb/1.2.4b4/lib/python2.6/site-packages/MySQLdb/cursors.py", line 234, in executemany r = r + self.execute(query, a)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-mysqldb/1.2.4b4/lib/python2.6/site-packages/MySQLdb/cursors.py", line 202, in execute self.errorhandler(self, exc, value)
File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/external/py2-mysqldb/1.2.4b4/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue
Backtrace: File "/data/srv/wmagent/v1.0.0.patch2/sw.pre.amaltaro/slc6_amd64_gcc481/cms/wmagent/1.0.0.patch2/lib/python2.6/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 198, in call raise ex
2014-11-11 17:25:50,838:INFO:BaseWorkerThread:Worker thread <WMCore.BossAir.StatusPoller.StatusPoller instance at 0x2f4b908> terminated
2014-10-28 23:13:17 7f7e783a2700InnoDB: transactions deadlock detected, dumping detailed information.
2014-10-28 23:13:17 7f7e783a2700
* (1) TRANSACTION:
TRANSACTION 2908822, ACTIVE 37 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 16 lock struct(s), heap size 2936, 23 row lock(s), undo log entries 15
MySQL thread id 215830, OS thread handle 0x7f7e781e9700, query id 71363435 localhost cmsdataops updating
UPDATE wmbs_fileset SET open = 1, last_update = 1414555997 WHERE name = '/dmason_HCA-Spring14dr-Backfill-00002_00187_v1__141022_060213_3297/StepOneProc/unmerged-logArchive'
* (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4 page no 5 n bits 168 index PRIMARY
of table wmagent
.wmbs_fileset
trx id 2908822 lock_mode X locks rec but not gap waiting
* (2) TRANSACTION:
TRANSACTION 2908821, ACTIVE 38 sec inserting
mysql tables in use 2, locked 1
17 lock struct(s), heap size 2936, 108 row lock(s), undo log entries 495
MySQL thread id 217627, OS thread handle 0x7f7e783a2700, query id 71363437 localhost cmsdataops Sending data
INSERT INTO wmbs_fileset_files (fileid, fileset, insert_time)
SELECT id, 13, 1414555997 FROM wmbs_file_details WHERE lfn = '/store/unmerged/Spring14drBackfill/QCD_Pt-5to3000_Tune4C_Flat_13TeV_pythia8/GEN-SIM-RECO/BACKFILL-v6/10076/7ADA4756-1D5F-E411-92E2-7845C4FC3B18.root'
* (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 4 page no 5 n bits 168 index PRIMARY
of table wmagent
.wmbs_fileset
trx id 2908821 lock mode S locks rec but not gap
* (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4 page no 5 n bits 168 index PRIMARY
of table wmagent
.wmbs_fileset
trx id 2908821 lock mode S locks rec but not gap waiting
* WE ROLL BACK TRANSACTION (1)
@amaltaro
After carefully looking into the mysql on cmsgwms-submit1.fnal.gov, it seemed that the new mysql plugin was already in the current mysql server because the three tables that the plugin brings in is already in the DB, however, there is no data in the table. Reading online(http://www.chriscalender.com/debugging-innodb-locks-using-the-new-innodb-plugin%E2%80%99s-information-schema-tables/), we need to change the config file and restart mysql.
Alan, Could please do below?
2.) Enabling the InnoDB Lock Monitor: CREATE TABLE innodb_lock_monitor (a INT) ENGINE=INNODB;
3.)Run "mysqladmin debug" Tell the server to write debug in the error log.
Then we will wait for the deadlock. When we get the deadlock, please let me know and I will need to get the lock info from the tables before we can restart the agents.
If we get useful info out of this, I assume we can put this in the spec file for the future?
Yuyi, I do not know where you see 2 lines only :-) I added the lines you provided in 1) to this file: /data/srv/wmagent/current/config/mysql/my.cnf but then I'm not able to start mysql, it gets stuck with this output
Starting mysql...
starting mysqld_safe...
Checking MySQL Socket file exists...
BTW, don't we use dll libraries for windows only? Maybe there are different setups for Linux box? BTW 2, since cmsgwms-submit1 is now being used for real production, I'm trying to replicate these configurations in our testing box (cmssrv113).
Just in case, I also paste here a diff between the original file and the new one:
[cmsdataops@cmssrv113 current]$ diff -rup config/mysql/my.cnf{.ori,}
--- config/mysql/my.cnf.ori 2014-12-10 10:00:45.786283947 -0600
+++ config/mysql/my.cnf 2014-12-10 10:04:09.009188281 -0600
@@ -34,6 +34,18 @@ innodb_checksums=0
innodb_doublewrite=0
innodb_locks_unsafe_for_binlog = true
+# write deadlock information to the error log
+innodb_print_all_deadlocks=1
+ignore_builtin_innodb
+plugin-load=innodb=ha_innodb_plugin.dll
+innodb_trx=ha_innodb_plugin.dll
+innodb_locks=ha_innodb_plugin.dll
+innodb_lock_waits=ha_innodb_plugin.dll
+innodb_cmp=ha_innodb_plugin.dll
+innodb_cmp_reset=ha_innodb_plugin.dll
+innodb_cmpmem=ha_innodb_plugin.dll
+innodb_cmpmem_reset=ha_innodb_plugin.dll
Alan, sorry for the trouble. Anything is current running on cmssrv113? If not, could you give me access to the machine? I will figure out it using the testbed before I ask you try it on cmsgwms-submit1.
Julian is running a test there, if possible we should try not to kill that, but if we do... then it's not the end of the world :-) cmssrv113 is at FNAL, so it uses the usual cmsdataops user. If that does not work for you, talk to Krista plz.
NP. I can wait. Julian, please let me know when the test is done.
It seems the machine is free now. Go ahead.
I setup mariadb on cmssrv113 with the new configureation, disable build-in innodb and enabled plugin innodb. These changes include: removing some of the deprecated gloabl values and changing to new set of parameters and turning off autocommit. By default, mariadb is autocommit on. I don't know how the wmagent API control the transactions. If the autocommit is on, all the DMLs comitted as soon as thay are executed? This will increase the chance of deadlock.
I also ran mysqladmin by hand to send debug info into error log. Due to all the logs and monitoring, the db server will run slower than normal.
Julian, could you please run some tests on this node? In case you need to resart the db server, I have to run some commands manully before you run the tests.
Let me know if you see any differece during the test regarding the speed of db access or any other issues? I will examin the logs during the tests.
@yuyiguo thanks for setting up cmssrv113 to deadlock debug mode. Just a reminder:
I've just managed to replicate the/a deadlock in vocms0304, JobStatusLite component is dead at the moment and I'll wait for you to look at it. I'm sending you more details in a private mail.
Oracle deadlock was reproduced and trace file showed that the problem was caused by missing indexes on foreign keys. So I have added below indexes.
CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_WMBS ON CMS_WMBS_PROD9.BL_RUNJOB(WMBS_ID);
CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_LOCATION ON CMS_WMBS_PROD9.BL_RUNJOB(LOCATION);
CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_STATUS ON CMS_WMBS_PROD9.BL_RUNJOB(SCHED_STATUS);
CREATE INDEX CMS_WMBS_PROD9.IDX_BL_RUNJOB_USERS ON CMS_WMBS_PROD9.BL_RUNJOB(USER_ID);
Alan will test the change by reproduce the deadlock.
I haven't seen this happening again in Oracle. Since we have mostly used CERN agents for the last half year, I haven't seen this in MySQL as well. I actually don't remember if we managed to fix anything on the mysql backend.
Creating this issue so we do not lose track of it.
Here is part the traceback got back from the JobStatusLite logs