dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

JobAccountant - component failure due to a thrown SQL exception from a single workflow #11819

Open todor-ivanov opened 10 months ago

todor-ivanov commented 10 months ago

Impact of the bug WMagent - JobAccountant

Describe the bug

During the work of migration to storage.json for stage in and stage out, we have stumbled on a failure [1] of the JobAccountant component which was caused due to a broken SQL query in one of the steps of the component's cycle. This was firstly reported here: https://github.com/dmwm/WMCore/pull/11790#issuecomment-1855343294

Even though the origin of the error is in a change that has left a WMBS file record without location, the effect is definitely undesired - the component completely failed. Even more, once the workflow itself has been aborted, the records in WMBS were still remaining and the component continued crashing. The place in JobAccountant, where this exception is thrown is here:

https://github.com/dmwm/WMCore/blob/87ea437f508308b5e1d544a234a80e2a9911d5c1/src/python/WMComponent/JobAccountant/AccountantWorker.py#L836-L867

How to reproduce it By breaking a file record in WMBS

Expected behavior The component should not break due to a single broken record in the database. Instead:

Additional context and error message [1]

2023-12-13 18:46:19,912:140004852008704:WARNING:AccountantWorker:The following file does not have any location: 

{'acquisitionEra': 'DMWM_Test',
 'branch_hash': 'eea4c0314f37f02ed8c1189b31abed42',
 'branches': [],
 'catalog': '',
 'checksums': {'adler32': '95f78c91', 'cksum': '519409673'},
 'configURL': 'https://cmsweb.cern.ch/couchdb;;reqmgr_config_cache;;a8fc2cf8e27cd8dd50cfde908ca8115f',
 'dataset': {'applicationName': 'cmsRun',
             'applicationVersion': 'CMSSW_12_4_14_patch1',
             'dataTier': 'MINIAOD',
             'primaryDataset': 'MuonEG',
             'processedDataset': 'DMWM_Test-ReReco_Run2022C_LumiMask_StageOutTest_Todor_v4-v11'},
 'events': 191,
 'fileRef': '<WMCore.Configuration.ConfigSection object at 0x7f556af7d610>',
 'first_event': 0,
 'globalTag': '124X_dataRun3_v15',
 'guid': '439838d7-40d1-480e-bc5c-d2d1f404d250',
 'input': ['/store/data/Run2022C/MuonEG/RAW/v1/000/355/865/00000/7c5c1c3c-178a-41c5-92ad-73e3b6dd09c8.root'],
 'inputPath': '/MuonEG/Run2022C-v1/RAW',
 'inputpfns': ['root://cmsxrootd-site.fnal.gov//store/data/Run2022C/MuonEG/RAW/v1/000/355/865/00000/7c5c1c3c-178a-41c5-92ad-73e3b6dd09c8.root'],
 'last_event': 0,
 'lfn': '/store/unmerged/DMWM_Test/MuonEG/MINIAOD/ReReco_Run2022C_LumiMask_StageOutTest_Todor_v4-v11/00000/439838d7-40d1-480e-bc5c-d2d1f404d250.root',
 'locations': set(),
 'merged': False,
 'module_label': 'MINIAODoutput',
 'outputModule': 'MINIAODoutput',
 'parents': set(),
 'pfn': '/srv/job/WMTaskSpace/cmsRun1/MINIAODoutput.root',
 'prep_id': 'TEST-ReReco-Run2022C-MuonEG-testlumi-00001',
 'processingVer': 11,
 'runs': '{<WMCore.DataStructs.Run.Run object at 0x7f556a7287c0>}',
 'size': 13836561,
 'validStatus': 'PRODUCTION'}

2023-12-13 18:46:19,912:140004852008704:WARNING:AccountantWorker:Job 78 , bad jobReport, failing job
2023-12-13 18:46:19,958:140004852008704:ERROR:AccountantWorker:Error while adding files to WMBS!
(MySQLdb.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ':location\n                 AND wfd.lfn = :lfn' at line 4")
[SQL: INSERT INTO wmbs_file_location (fileid, pnn)
                 SELECT wfd.id, wpnn.id
                 FROM wmbs_pnns wpnn, wmbs_file_details wfd
                 WHERE wpnn.pnn = :location
                 AND wfd.lfn = :lfn]
(Background on this error at: https://sqlalche.me/e/14/f405)
2023-12-13 18:46:20,057:140004852008704:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller object at 0x7f556b7f2b80> <@========== WMException Start ==========@>
Exception Class: AccountantWorkerException
Message: Error while adding files to WMBS!
(MySQLdb.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ':location\n                 AND wfd.lfn = :lfn' at line 4")
[SQL: INSERT INTO wmbs_file_location (fileid, pnn)
                 SELECT wfd.id, wpnn.id
                 FROM wmbs_pnns wpnn, wmbs_file_details wfd
                 WHERE wpnn.pnn = :location
                 AND wfd.lfn = :lfn]
(Background on this error at: https://sqlalche.me/e/14/f405)
        ClassName : None
        ModuleName : WMComponent.JobAccountant.AccountantWorker
        MethodName : handleWMBSFiles
        ClassInstance : None
        FileName : /data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py
        LineNumber : 867
        ErrorNr : 0

Traceback: 
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 851, in handleWMBSFiles
    self.setFileLocation.execute(lfn=fileLocations,

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/WMBS/MySQL/Files/SetLocationByLFN.py", line 35, in execute
    self.dbi.processData(self.sql, binds, conn=conn, transaction=transaction)

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/DBCore.py", line 148, in processData
    r = self.executebinds(i, connection=connection,

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/MySQLCore.py", line 119, in executebinds
    return DBInterface.executebinds(self, s, b, connection, returnCursor)

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/DBCore.py", line 62, in executebinds
    resultProxy = connection.execute(s)

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1365, in execute
    return self._exec_driver_sql(

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1669, in _exec_driver_sql
    ret = self._execute_context(

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
    self._handle_dbapi_exception(

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
    util.raise_(

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-mysqlclient/2.1.1/lib/python3.8/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-mysqlclient/2.1.1/lib/python3.8/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)

  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-mysqlclient/2.1.1/lib/python3.8/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)

<@---------- WMException End ----------@>  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/JobAccountantPoller.py", line 70, in algorithm
    self.accountantWorker(jobsSlice)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 247, in __call__
    self.handleWMBSFiles(self.wmbsFilesToBuild, self.parentageBinds)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 867, in handleWMBSFiles
    raise AccountantWorkerException(msg)

2023-12-13 18:46:20,057:140004852008704:INFO:Harness:>>>Terminating worker threads
amaltaro commented 10 months ago

@todor-ivanov Todor, I suspect this is actually a buggy job report with:

 'locations': set(),

instead of an incomplete SQL statement. If there is something to fix, I would say we need to ensure that job reports report a non-empty location for the output files.

todor-ivanov commented 10 months ago

hi @amaltaro

Yes, I completely agree with you - The current reason is the a missing WMBS file location, which I already mentioned. And we even know the reason why this location for the current workflow was left empty. And this reason is not related to the current bug, but was rather a consequence of a separate development and a test in the context of the migration to the storage.json. The current issue I created, because this whole set of DAO calls cited in the GH description, would inevitably lead to a complete crash of the component triggered by a single workflow (a file set with missing location this time or any other error of the sort in the future). To me, we should protect the component from interrupting under such scenario.

amaltaro commented 10 months ago

Yes, the approach we have been taking with the WMAgent components is that some errors are supposed to be soft while others are hard errors and a crash of the component is necessary to catch developers attention to properly investigate the issue. I agree that not having components crashing is the ultimate goal, but I fear that we won't pay enough attention to problems if we just treat everything as a soft error and leave developers with a sole notification of the error.

For this specific "no locations" issue, we used to have the same problem over the past many years and given the failure to properly identify and fix it at the root, we made this workaround: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobAccountant/AccountantWorker.py#L504-L508

which I expected to catch this case as well.

In addition to the solution you provided, it's not clear to me what you suggest to be done with the potentially ill job? Would retry it until someone take care of that?