CMSCompOps / WmAgentScripts

CMS Workflow Team Scripts
7 stars 51 forks source link

showError gets stuck due to a locking mechanism #1114

Open haozturk opened 1 year ago

haozturk commented 1 year ago

Impact of the bug Unified/showError.py

Describe the bug The job gets stuck since it waits for a lock to get free, but lock isn't freed. I don't know the details of this locking. We should analyze.

Waiting for other createLogDB_pdmvserv_task_HIG-RunIISummer15wmLHEGS-05496__v1_T_220914_152834_6090 components to stop running
[{'_id': ObjectId('63c5096b94ebe3250cce84f7'), 'component': 'createLogDB_pdmvserv_task_HIG-RunIISummer15wmLHEGS-05496__v1_T_220914_152834_6090', 'host': 'vocms0277.cern.ch', 'pid': 1541838, 'time': 1673853787.0, 'date': 'Mon Jan 16 08:23:07 2023'}]

How to reproduce it Check the log 2023-01-13_12:35:07.log

Expected behavior We need to understand this locking mechanism and why it takes so long.

Additional context and error message None @z4027163 fyi

haozturk commented 1 year ago

I disabled the logbuster functionality: https://github.com/CMSCompOps/WmAgentScripts/commit/ac957243da4109f55604fd2b7bc8e3297173fb25 to observe how it'll evolve