dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Workflow ACDC Data endpoint timeout in GlobalWorkQueue #12003

Open hassan11196 opened 4 months ago

hassan11196 commented 4 months ago

Impact of the bug GlobalWorkQueue, WMCore Central Services

Describe the bug The GlobalWorkQueue fails to fetch data from the ACDC server and timeouts using the method _getFilesetInfo in the DataCollectionService.

Error Log from Agent:

ERROR:reqmgrInteraction:Exception splitting wqe cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136 for cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136: (28, 'Operation timed out after 270000 milliseconds with 786160985 bytes received')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1138, in processInboundWork
    work, rejectedWork, badWork = self._splitWork(inbound['WMSpec'], data=inbound['Inputs'],
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1078, in _splitWork
    units, rejectedWork, badWork = policy(spec, topLevelTask, data, mask, continuous=continuous)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 167, in __call__
    self.split()
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py", line 57, in split
    for block in self.validBlocks(self.initialTask):
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py", line 122, in validBlocks
    validBlocks = splittingFunc(acdc, acdcInfo, task)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py", line 154, in singleChunk
    acdcBlock = acdc.singleChunkFileset(acdcInfo['collection'],
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/CouchUtils.py", line 63, in wrapper
    return funcRef(x, *args, **opts)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ACDC/DataCollectionService.py", line 299, in singleChunkFileset
    files = self._getFilesetInfo(collectionName, filesetName)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/CouchUtils.py", line 63, in wrapper
    return funcRef(x, *args, **opts)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ACDC/DataCollectionService.py", line 225, in _getFilesetInfo
    results = self.couchdb.loadView("ACDC", "coll_fileset_docs", option, keys)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 576, in loadView
    retval = self.post('/%s/_design/%s/_view/%s?%s' % \
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 154, in post
    return self.makeRequest(uri, data, 'POST', incoming_headers,
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 133, in makeRequest
    result, status, reason, cached = JSONRequests.makeRequest(
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/usr/local/lib/python3.8/site-packages/Utils/PortForward.py", line 66, in portMangle
    return callFunc(callObj, newUrl, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 338, in request
    curl.perform()
pycurl.error: (28, 'Operation timed out after 270000 milliseconds with 786160985 bytes received')
ERROR:reqmgrInteraction:Unknown error processing cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136

image

How to reproduce it Steps to reproduce the behavior: I can't access the GlobalWorkQueue logs at the moment, but from my Discussion with Alan. The GlobalWorkQueue will keep trying to fetch this info at every run, so the logs should have the record of it being tried continously. cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136

Expected behavior The Endpoint should not timeout.

Additional context and error message

Log File: workqueue-20240529-workqueue-69546dff49-2lj7r.log

khurtado commented 4 months ago

@amaltaro Shouldn't we have a dynamic (or larger value) approach for the transfers (not the timeout)?

Looking at the pycurl timeout settings:

https://github.com/dmwm/WMCore/blob/86bc3ff3cd9f5ea02df010dd44c4cd3228e0b2cd/src/python/WMCore/Services/pycurl_manager.py#L446-L449

I see we define both CONNECTTIMEOUT and TIMEOUT to 270. While CONNECTTIMEOUT does make sense, TIMEOUT does not make sense to me. According to pycurl documentation, this is the total transfer time we are allowed, including the connection timeout.

https://curl.se/libcurl/c/CURLOPT_TIMEOUT.html

That means, if we take 200 seconds to connect for any reason, we will only have 70 seconds left to finish the transfer. Pycurl recommends to be careful with the TIMEOUT option for dynamic transfers.

CURL has a default connection timeout of 300 seconds and a default TIMEOUT of 0, meaning no limit after it starts the transfer. Even if we don't define TIMEOUT as 0, I think this value should be at least twice CONNECTTIMEOUT, otherwise it makes no sense if we manage to connect after e.g.: 269 seconds only to timeout after 1 more second due to the TIMEOUT setting.

For this particular issue, it looks like pycurl did download about 780MB in those 270 seconds (so, 2.8 MB/s or so average transfer rate of we neglect the connection time) and then it timed out due to this setting. I tested setting TIMEOUT on an isolated case and indeed, pycurl will just timeout if we get past this walltime limit even if the transfer is going normally, the message was consistent with the WQ logfiles, meaning this timeout is due to TIMEOUT and not due to CONNECTTIMEOUT.

amaltaro commented 4 months ago

@khurtado before we start looking into the source code, I think it would be better to understand the size of the workload spec and why it is so large. Hundreds of MBs of spec file makes no sense at all to me.

amaltaro commented 3 months ago

I somehow misread this issue and thought that the problem was on the agent(s) acquiring such large workflows. It turns out it is all in the global workqueue, where workflows are failing to be acquired (transition from staged to acquired status).

Looking into reqmgrInteractionTask-workqueue-69546dff49-2lj7r-20240610.log log, I see the following very large workflows:

pycurl.error: (28, 'Operation timed out after 270000 milliseconds with 751166508 bytes received')
2024-06-10 23:37:36,877:ERROR:WorkQueueReqMgrInterface:Unknown error processing cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00265__v1_T_240410_201352_2044

and

pycurl.error: (28, 'Operation timed out after 269999 milliseconds with 787209247 bytes received')
2024-06-10 23:33:06,474:ERROR:WorkQueueReqMgrInterface:Unknown error processing cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136

and it could be that they have even much more data to be looked up from the database, but we know that they are at least >700MB of data from the ACDCServer.

Given that these have been failing for a weeks now - and that I see some degradation with CouchDB replication to the agents - I wonder if we can reject these ACDC workflows for now and investigate these in a more controlled manner?

khurtado commented 3 months ago

@amaltaro I see both workflows were aborted already. GWQ is not showing time out messages anymore, so it seems it was these 2 only.

khurtado commented 3 months ago

I reproduced the issue interactively:

WQ times out at this point when loading Couch ACDC views:

  File "/usr/local/lib/python3.8/site-packages/WMCore/ACDC/DataCollectionService.py", line 299, in singleChunkFileset
    files = self._getFilesetInfo(collectionName, filesetName)

I looked at the following request: cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136

The ACDC collection: pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765: and fileset: 'pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765/BPH-RunIISummer20UL18GEN-00268_0`

is what failed to load and times out

>>> from WMCore.WMSpec.WMWorkload import WMWorkloadHelper
>>> from WMCore.ACDC.DataCollectionService import DataCollectionService
>>> HELPER = WMWorkloadHelper()
>>> reqUrl = "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache"
>>> s = HELPER.loadSpecFromCouch(reqUrl, requestName="cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136")
>>> HELPER.getRequestType()
'TaskChain'
>>> dcs = DataCollectionService(database="acdcserver", url="https://cmsweb.cern.ch/couchdb")
>>> for task in HELPER.getAllTasks():
...     print(task.name())
...     print(task.getInputACDC())
...
BPH-RunIISummer20UL18GEN-00268_0
{'server': 'https://cmsweb.cern.ch/couchdb', 'collection': 'pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765', 'fileset': '/pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765/BPH-RunIISummer20UL18GEN-00268_0', 'database': 'acdcserver'}
dcs._getFilesetInfo("pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765", "/pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765/BPH-RunIISummer20UL18GEN-00268_0")

If pycurl timeout is increased to 600s, I get to fetch the object, which is HUGE and looks like this:

 'MCFakeFile-pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765-BPH-RunIISummer20UL18GEN-00268_0-0d99bf297947b020bf1838f42f2e5639', 'locations': ['T1_DE_KIT_Disk', 'T1_ES_PIC_Disk', 'T1_FR_CCIN2P3_Disk', 'T1_IT_CNAF_Disk', 'T1_RU_JINR_Disk', 'T1_UK_RAL_Disk', 'T1_US_FNAL_Disk', 'T2_AT_Vienna', 'T2_BE_IIHE', 'T2_BE_UCL', 'T2_BR_SPRACE', 'T2_CH_CERN', 'T2_CH_CSCS', 'T2_DE_DESY', 'T2_DE_RWTH', 'T2_EE_Estonia', 'T2_ES_CIEMAT', 'T2_FR_GRIF', 'T2_FR_GRIF_IRFU', 'T2_FR_GRIF_LLR', 'T2_IT_Bari', 'T2_IT_Legnaro', 'T2_IT_Pisa', 'T2_IT_Rome', 'T2_KR_KISTI', 'T2_PL_Swierk', 'T2_RU_JINR', 'T2_TW_NCHC', 'T2_UA_KIPT', 'T2_UK_London_IC', 'T2_UK_SGrid_RALPP', 'T2_US_Caltech', 'T2_US_Caltech_Ceph', 'T2_US_Nebraska', 'T2_US_Purdue', 'T2_US_UCSD', 'T2_US_Wisconsin', 'T3_CH_CERNBOX', 'T3_US_Baylor', 'T3_US_Colorado'], 'id': 952928, 'checksums': {}, 'events': 70281, 'merged': '0', 'size': 0, 'runs': [{'run_number': 1, 'lumis': [540872]}], 'parents': []}, {'last_event': 0, 'first_event': 3653487505, 'lfn': 'MCFakeFile-pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765-BPH-RunIISummer20UL18GEN-00268_0-0d99bf297947b020bf1838f42f2e5639', 'locations': ['T1_DE_KIT_Disk', 'T1_ES_PIC_Disk', 'T1_FR_CCIN2P3_Disk', 'T1_IT_CNAF_Disk', 'T1_RU_JINR_Disk', 'T1_UK_RAL_Disk', 'T1_US_FNAL_Disk', 'T2_AT_Vienna', 'T2_BE_IIHE', 'T2_BE_UCL', 'T2_BR_SPRACE', 'T2_CH_CERN', 'T2_CH_CSCS', 'T2_DE_DESY', 'T2_DE_RWTH', 'T2_EE_Estonia', 'T2_ES_CIEMAT', 'T2_FR_GRIF', 'T2_FR_GRIF_IRFU', 'T2_FR_GRIF_LLR', 'T2_IT_Bari', 'T2_IT_Legnaro', 'T2_IT_Pisa', 'T2_IT_Rome', 'T2_KR_KISTI', 'T2_PL_Swierk', 'T2_RU_JINR', 'T2_TW_NCHC', 'T2_UA_KIPT', 'T2_UK_London_IC', 'T2_UK_SGrid_RALPP', 'T2_US_Caltech', 'T2_US_Caltech_Ceph', 'T2_US_Nebraska', 'T2_US_Purdue', 'T2_US_UCSD', 'T2_US_Wisconsin', 'T3_CH_CERNBOX', 'T3_US_Baylor', 'T3_US_Colorado'], 'id': 952928, 'checksums': {}, 'events': 70281, 'merged': '0', 'size': 0, 'runs': [{'run_number': 1, 'lumis': [540873]}], 'parents': []}, {'last_event': 0, 'first_event': 3653557786, 'lfn': 'MCFakeFile-pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765-BPH-RunIISummer20UL18GEN-00268_0-0d99bf297947b020bf1838f42f2e5639', 'locations': ['T1_DE_KIT_Disk', 'T1_ES_PIC_Disk', 'T1_FR_CCIN2P3_Disk', 'T1_IT_CNAF_Disk', 'T1_RU_JINR_Disk', 'T1_UK_RAL_Disk', 'T1_US_FNAL_Disk', 'T2_AT_Vienna', 'T2_BE_IIHE', 'T2_BE_UCL', 'T2_BR_SPRACE', 'T2_CH_CERN', 'T2_CH_CSCS', 'T2_DE_DESY', 'T2_DE_RWTH', 'T2_EE_Estonia', 'T2_ES_CIEMAT', 'T2_FR_GRIF', 'T2_FR_GRIF_IRFU', 'T2_FR_GRIF_LLR', 'T2_IT_Bari', 'T2_IT_Legnaro', 'T2_IT_Pisa', 'T2_IT_Rome', 'T2_KR_KISTI', 'T2_PL_Swierk', 'T2_RU_JINR', 'T2_TW_NCHC', 'T2_UA_KIPT', 'T2_UK_London_IC', 'T2_UK_SGrid_RALPP', 'T2_US_Caltech', 'T2_US_Caltech_Ceph', 'T2_US_Nebraska', 'T2_US_Purdue', 'T2_US_UCSD', 'T2_US_Wisconsin', 'T3_CH_CERNBOX', 'T3_US_Baylor', 'T3_US_Colorado'], 'id': 952928, 'checksums': {}, 'events': 70281, 'merged': '0', 'size': 0, 'runs': [{'run_number': 1, 'lumis': [540874]}], 'parents': []}, {'last_event': 0, 'first_event': 3653628067, 'lfn': 'MCFakeFile-pdmvserv_task_BPH-RunIISummer20UL18GEN-00268__v1_T_231220_233317_8765-BPH-RunIISummer20UL18GEN-00268_0-0d99bf297947b020bf1838f42f2e5639', 'locations': ['T1_DE_KIT_Disk', 'T1_ES_PIC_Disk', 'T1_FR_CCIN2P3_Disk', 'T1_IT_CNAF_Disk', 'T1_RU_JINR_Disk', 'T1_UK_RAL_Disk', 'T1_US_FNAL_Disk', 'T2_AT_Vienna', 'T2_BE_IIHE', 'T2_BE_UCL', 'T2_BR_SPRACE', 'T2_CH_CERN', 'T2_CH_CSCS', 'T2_DE_DESY', 'T2_DE_RWTH', 'T2_EE_Estonia', 'T2_ES_CIEMAT', 'T2_FR_GRIF', 'T2_FR_GRIF_IRFU', 'T2_FR_GRIF_LLR', 'T2_IT_Bari', 'T2_IT_Legnaro', 'T2_IT_Pisa', 'T2_IT_Rome', 'T2_KR_KISTI', 'T2_PL_Swierk', 'T2_RU_JINR', 'T2_TW_NCHC', 'T2_UA_KIPT', 'T2_UK_London_IC', 'T2_UK_SGrid_RALPP', 'T2_US_Caltech', 'T2_US_Caltech_Ceph', 'T2_US_Nebraska', 'T2_US_Purdue', 'T2_US_UCSD', 'T2_US_Wisconsin', 'T3_CH_CERNBOX', 'T3_US_Baylor', 'T3_US_Colorado'], 'id': 952928, 'checksums': {}, 'events': 70281, 'merged': '0', 'size': 0, 'runs': [{'run_number': 1, 'lumis': [540875]}], 'parents': []}, {'last_event': 0, 'first_event': 3653698348, 'lfn

@hassan11196 Is the fileset information from this workflow expected to be this huge (several hundreds of MBs)?

amaltaro commented 3 months ago

@khurtado given that these requests were failing for many weeks already and I started seeing random CouchDB issues, I thought it would be better to abort them for now, until we can better understand this problem and how to recover those workflows.

If pycurl timeout is increased to 600s, I get to fetch the object, which is HUGE and looks like this:

That's somehow surprising to me. We have always had a hard timeout of 5min in the CMSWEB frontends. Given that you kept the connection open for 10min, it makes me believe that those frontend timeouts have been lifted.

Given that you managed to fetch this data, can you please dump it into the filesystem such that we can start analysing it?

The few questions I have so far are:

Do you see anything else that might be relevant for this investigation?

khurtado commented 3 months ago

@amaltaro

Now cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00268__v1_T_240429_202726_4136 gives me an empty object. But cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00265__v1_T_240410_201352_2044 still works

Here are my answers based on that workflow ( cmsunified_ACDC0_task_BPH-RunIISummer20UL18GEN-00265__v1_T_240410_201352_2044):

Original request is: pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834 This workflow had a 63.5% failure rate, which translates into 594 failed jobs

>>> type(o)
<class 'list'>
>>> len(o)
626666
>>> o[0]
{'last_event': 0, 'first_event': 2209931464, 'lfn': 'MCFakeFile-pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834-BPH-RunIISummer20UL18GEN-00265_0-0037bc6d7147a9a141fa59c99035b93c', 
'locations': ['T1_DE_KIT_Disk', 'T1_ES_PIC_Disk', 'T1_FR_CCIN2P3_Disk', 'T1_IT_CNAF_Disk', 'T1_RU_JINR_Disk', 
'T1_UK_RAL_Disk', 'T1_US_FNAL_Disk', 'T2_AT_Vienna', 'T2_BE_IIHE', 'T2_BE_UCL', 'T2_BR_SPRACE', 'T2_CH_CERN', 
'T2_CH_CSCS', 'T2_DE_DESY', 'T2_DE_RWTH', 'T2_EE_Estonia', 'T2_ES_CIEMAT', 'T2_FR_GRIF', 'T2_FR_GRIF_IRFU', 
'T2_FR_GRIF_LLR', 'T2_IT_Bari', 'T2_IT_Legnaro', 'T2_IT_Pisa', 'T2_IT_Rome', 'T2_KR_KISTI', 'T2_PL_Swierk', 
'T2_RU_JINR', 'T2_TW_NCHC', 'T2_UA_KIPT', 'T2_UK_London_IC', 'T2_UK_SGrid_RALPP', 'T2_US_Caltech', 
'T2_US_Caltech_Ceph', 'T2_US_Nebraska', 'T2_US_Purdue', 'T2_US_UCSD', 'T2_US_Wisconsin', 'T3_CH_CERNBOX', 
'T3_US_Baylor', 'T3_US_Colorado'], 
'id': 1175137, 'checksums': {}, 'events': 103263, 'merged': '0', 'size': 0,
 'runs': [{'run_number': 1, 'lumis': [437322]}], 'parents': []}
>>> o[1]
{'last_event': 0, 'first_event': 2252166031, 'lfn': 'MCFakeFile-pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834-BPH-RunIISummer20UL18GEN-00265_0-0037bc6d7147a9a141fa59c99035b93c',
 'locations': ['T1_DE_KIT_Disk', 'T1_ES_PIC_Disk', 'T1_FR_CCIN2P3_Disk', 'T1_IT_CNAF_Disk', 'T1_RU_JINR_Disk', 
'T1_UK_RAL_Disk', 'T1_US_FNAL_Disk', 'T2_AT_Vienna', 'T2_BE_IIHE', 'T2_BE_UCL', 'T2_BR_SPRACE', 'T2_CH_CERN', 
'T2_CH_CSCS', 'T2_DE_DESY', 'T2_DE_RWTH', 'T2_EE_Estonia', 'T2_ES_CIEMAT', 'T2_FR_GRIF', 'T2_FR_GRIF_IRFU', 
'T2_FR_GRIF_LLR', 'T2_IT_Bari', 'T2_IT_Legnaro', 'T2_IT_Pisa', 'T2_IT_Rome', 'T2_KR_KISTI', 'T2_PL_Swierk', 
'T2_RU_JINR', 'T2_TW_NCHC', 'T2_UA_KIPT', 'T2_UK_London_IC', 'T2_UK_SGrid_RALPP', 'T2_US_Caltech', 
'T2_US_Caltech_Ceph', 'T2_US_Nebraska', 'T2_US_Purdue', 'T2_US_UCSD', 'T2_US_Wisconsin', 'T3_CH_CERNBOX', 
'T3_US_Baylor', 'T3_US_Colorado'], 
'id': 1175137, 'checksums': {}, 'events': 103263, 'merged': '0', 'size': 0,
 'runs': [{'run_number': 1, 'lumis': [437731]}], 'parents': []}
khurtado commented 3 months ago

@amaltaro Considering the data from above, what should be done? The structure is how it's supposed to, it's just that there are way too many entries

amaltaro commented 3 months ago

@khurtado the following 2 comments had me lost.

This workflow had a 63.5% failure rate, which translates into 594 failed jobs

This is a list with +626K elements

How many jobs have failed for this workflow? Is it 594 or 626k jobs?

Note that the workflow is not meant to recover every single document uploaded to the ACDC server, but only those that match this collection name (extracted from the ReqMgr2 workflow description):

  "InitialTaskPath": "/pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834/BPH-RunIISummer20UL18GEN-00265_0",

Translating this to the ACDC collection name, we need to query (don't do this in your browser!): https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834%22&include_docs=true&reduce=false

and then match against rows/docs that belong to the fileset_name (I could be mistaken on the property name) "/pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834/BPH-RunIISummer20UL18GEN-00265_0"

Can you please pull this data from CouchDB - with curl or something similar - save it in a JSON file and then start using that json to: 1) how many rows/docs in total we have in this collection (and what is the total size of the JSON file)? 2) how many rows/docs we have matching the InitialTaskPath above (and what would be the total size of a JSON file only with those in)? 3) IF we were to change the locations value to a comma separated string, what would be the results for 2) above? 4) IF we were to change the locations value to an empty list, what would be the results for 2) above?

I guess some of these you might have already answered Kenyi, but in any case I think it will be helpful to compile all of this together.

If you have any other suggestions of how we could handle such huge number of documents as well, please do share them with us.

khurtado commented 3 months ago

@amaltaro Looking at the condor history, for the original request pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834, I see 360K jobs ran successfully and 640K failed. I will look into the curl queries later today

Screenshot 2024-07-05 at 1 31 09 PM
khurtado commented 3 months ago

@amaltaro

  1. how many rows/docs in total we have in this collection (and what is the total size of the JSON file)?

The total nunber rows from couchDB in the view itself is:

>>> collectionJson['total_rows']
3363688

The number of rows that passed the query parameters (key=workflowname&includeDocs=True&reduce=false) is:

>>> len(collectionJson['rows'])
627173
  1. how many rows/docs we have matching the InitialTaskPath above (and what would be the total size of a JSON file only with those in)?

About 99.9% of the rows. The total size of the json is 1.1 GB

>>> n=0
>>> for i in collectionJson['rows']:
...     fname = i['doc']['fileset_name']
...     initialTaskPath = '/pdmvserv_task_BPH-RunIISummer20UL18GEN-00265__v1_T_231220_233314_8834/BPH-RunIISummer20UL18GEN-00265_0'
...     if fname == initialTaskPath:
...         n+=1
...
>>> n
626666
  1. IF we were to change the locations value to a comma separated string, what would be the results for 2) above?

Do you mean from this:

['T2_BE_IIHE', 'T2_CH_CERN',...]

to this: 'T2_BE_IIHE,T2_CH_CERN,...'

It made no difference, 19 MB less

$ du -ms * 1031 concatenatedLocations.json 1050 timeout_query_WQ.json

  1. IF we were to change the locations value to an empty list, what would be the results for 2) above?

From 1.1 GB to 706 MB

[cmst1@vocms0283 /tmp/cmst1]$ ls -lsh *
706M -rw-r--r--. 1 cmst1 zh 706M Jul  6 02:05 emptyLocations.json
1.1G -rw-r--r--. 1 cmst1 zh 1.1G Jul  6 01:22 timeout_query_WQ.json

So, with option 3, we get no relevant difference. With empty locations, we go from an average of 1.68 bytes per row, to 1.15 bytes per row. I would say going below ~1 byte per row would be very challenging.

>>> 1055308 / 627173
1.6826425882491753
>>> 722116 / 627173
1.1513824734164257

We may be able to decrease the size by using a mapping. E.g.: T1_US_FNAL = 1, T2_CH_CERN =2, etc. We know ~35% would be the decrease limit (when using empty locations), so this approach may reduce it in ~20% maybe, I can test. I'm not sure how we can keep the mapping up to date when a new site enters production though.

Besides that, we can still increase the timeout limit and perhaps add a warning for collectionNames with over 500K rows, just to have a sense of how many of those we get and see if other actions are needed in the future. My guess is that if we haven't faced this issue before, it should not be often.

amaltaro commented 3 months ago

Thank you for checking those out, Kenyi.

I wish changing locations from list to string would be more impactful, but with your tests we can see that option has to be dropped.

With empty locations, we go from an average of 1.68 bytes per row, to 1.15 bytes per row. I would say going below ~1 byte per row would be very challenging.

Changing locations from a list of locations to an empty list yields better results, maybe it is the least intrusive change that we could make right now. Still, very large ACDC collections will remain large, as the storage decrease is only of about 35% (in terms of locations).

We may be able to decrease the size by using a mapping. E.g.: T1_US_FNAL = 1, T2_CH_CERN =2, etc. We know ~35% would be the decrease limit (when using empty locations), so this approach may reduce it in ~20% maybe, I can test. I'm not sure how we can keep the mapping up to date when a new site enters production though.

About the location mapping, I think that would be hard to implement as we do not have a unique and global resource control, in addition to having different resources in different agents. I fear it would be too confusing as well.

Besides that, we can still increase the timeout limit and perhaps add a warning for collectionNames with over 500K rows, just to have a sense of how many of those we get and see if other actions are needed in the future. My guess is that if we haven't faced this issue before, it should not be often.

Adding a hard limit to ACDC collections is likely something that we will have to implement indeed. Otherwise we cannot protect the system from hitting such problems at a much later stage. The challenge is that 500k records can be very much different, depending on the job construction and what we want to recover (e.g., large vs small locations list, large vs small lumi list).

Please keep make suggestions here, I am still undecided on how to proceed with this issue, but the 2 (or one of) above are likely going to be our path forward.

khurtado commented 1 month ago

@amaltaro Coming back to this, do you know if we use file compression in CouchDB?

https://couch-doc-test.readthedocs.io/en/latest/config/couchdb.html#couchdb/file_compression It looks like Google Snappy compression should be the default, but I'm not quite sure that's our case.

Also, it looks like in _getFilesetInfo:

https://github.com/dmwm/WMCore/blob/39c50491199ff505886caed4ab1d3661844b2329/src/python/WMCore/ACDC/DataCollectionService.py#L223

We should be able to add an extra option parameter with a hard limit on the number of rows.

https://github.com/dmwm/WMCore/blob/39c50491199ff505886caed4ab1d3661844b2329/src/python/WMCore/Database/CMSCouch.py#L525

We could add a warning in the debug messages whenever we reach this limit

Another alternative is that for this particular view for MC, most fields seem to repeat, so we could use the group feature in the couch view. Of course, then we would need to rewrite the logic of all methods using [_getFilesetInfo](https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ACDC/DataCollectionService.py#L218)

amaltaro commented 1 month ago

@khurtado we do not have anything specific in the configuration. The documentation isn't clear about the default configuration either: https://docs.couchdb.org/en/3.2.3-docs/config/couchdb.html#couchdb/file_compression

There is one REST API that we can fetch the CouchDB instance configuration though, we can check that out in testbed (production will be using the same configuration). Nonetheless, I think this compression is only for writing data to the filesystem. Meaning that the user will receive uncompressed data (and will send uncompressed data as well). If that is correct, then I am afraid this does not help with this issue.

About the limit option, I think this would be an obscure setting that is costly to understand whether an ACDC workflow is recovering all the documents or not.

Finally, the group option could be a good idea, is data is served to the client already grouped then. However, if it only works for MC and not for real data, we would have to add further logic based on the workflow type.

However, looking into an MC only solution, we might consider some refactoring to the MC job failure information, given that none of that is that useful in a recovery process. If we could ditch most of that information (and make lumi to be a range of lumis instead of a list), that could provide a substantial reduction.

khurtado commented 1 month ago

@amaltaro The group feature would require a map reduce function that I have not been able to figure out so far.

The Snappy compression is indeed for files on disk, not for communication to the client, so this is not an option.

But here is something I found: Looking into the views information: https://docs.couchdb.org/en/stable/api/ddoc/views.html

I see the encoding does not support Gzip/Deflate, only text/plain in a chunked format. So it seems we won't find a way to compress HTTP responses with current versions of CouchDB.

To workaround this problem, I see many people uses a reverse proxy to enable gzip. Ngninx is an option, but since CouchDB uses Etags there is a small issue with it described below:

https://broken-by.me/tag/accept-encoding-gzip/

The link above works around this by starting an http proxy using nodejs packages.

This enables gzip compression to HTTP responses which, in our HTTP responses in this issue, would reduce the size sent to the client from 1.1 GB to 40 MB! Curl accepts gzip encoding and deflates it, which I think we do already with mariadb queries, so we should be okay on the client side.

While we do not have to deploy the proxy using the same method, I think this may be our best bet. It is an option that would require some deployment effort (CMSKubernetes), but no WMCore code development effort, and the final result would work on MC and DATA with +30 times less network bandwidth for this sort of requests. We could reduce the size of the HTTP responses with some code development effort, but I don't think to the scale gzip compression offers.

CouchDB includes documentation on the usage of reverse proxies in their "Best practices" section:

https://docs.couchdb.org/en/stable/best-practices/reverse-proxies.html

The recommended proxy is HAproxy, which is available in docker hub: https://hub.docker.com/_/haproxy

So we could potentially deploy a new pod for the reverse proxy in CMSKubernetes. We would point CouchDB connections through the proxy and be done.

khurtado commented 1 month ago

As discussed in Today's meeting, since this issue doesn't happen often, we are moving the issue to Q4 while we decide on what the final solution for this should be.