dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

JobAccountant terminates with PNN doesn't exist in wmbs_pnns table #12016

Closed d-ylee closed 2 months ago

d-ylee commented 3 months ago

Impact of the bug Agent: cmsgwms-submit8, cmsgwms-submit7, vocms0252, vocms0253

Describe the bug While processing jobs, the JobAccountPoller terminates with an AccountWorkerException stating the PNN doesn't exist in wmbs_pnns table: T3_KR_KISTI (investigate). All agents listed above have the same error with the same PNN (T3_KR_KISTI). However, checking MariaDB in the agents show that T3_KR_KISTI is in the wmbs_pnns table.

How to reproduce it Steps to reproduce the behavior:

  1. Restart the agent in cmsgwms-submit8

Expected behavior Not for JobAccountant to terminate.

Additional context and error message

Traceback from submit8:

2024-06-07 15:47:51,705:140137858733824:ERROR:AccountantWorker:PNN doesn't exist in wmbs_pnns table: T3_KR_KISTI (investigate)
2024-06-07 15:47:51,854:140137858733824:INFO:Timers:### Handling WMBS unmerged files took 0.15 seconds to complete
2024-06-07 15:47:51,854:140137858733824:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller object at 0x7f74634a50a0> <@========== WMException Start ==========@>
Exception Class: AccountantWorkerException
Message: PNN doesn't exist in wmbs_pnns table: T3_KR_KISTI (investigate)
        ClassName : None
        ModuleName : WMComponent.JobAccountant.AccountantWorker
        MethodName : handleWMBSFiles
        ClassInstance : None
        FileName : /data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py
        LineNumber : 829
        ErrorNr : 0

Traceback:

<@---------- WMException End ----------@>  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/JobAccountantPoller.py", line 73, in algorithm
    self.accountantWorker(jobsSlice)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 250, in __call__
    self.handleWMBSFiles(self.wmbsFilesToBuild, self.parentageBinds)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 829, in handleWMBSFiles
    raise AccountantWorkerException(msg)

2024-06-07 15:47:51,854:140137858733824:INFO:Harness:>>>Terminating worker threads
2024-06-07 15:47:51,855:140137858733824:ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller object at 0x7f74634a50a0> <@========== WMException Start ==========@>
Exception Class: AccountantWorkerException
Message: PNN doesn't exist in wmbs_pnns table: T3_KR_KISTI (investigate)
        ClassName : None
        ModuleName : WMComponent.JobAccountant.AccountantWorker
        MethodName : handleWMBSFiles
        ClassInstance : None
        FileName : /data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py
        LineNumber : 829
        ErrorNr : 0

Traceback:

<@---------- WMException End ----------@>
Backtrace:
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 209, in __call__
    raise ex
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/JobAccountantPoller.py", line 73, in algorithm
    self.accountantWorker(jobsSlice)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 250, in __call__
    self.handleWMBSFiles(self.wmbsFilesToBuild, self.parentageBinds)
  File "/data/srv/wmagent/v2.3.3/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.3/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 829, in handleWMBSFiles
    raise AccountantWorkerException(msg)

2024-06-07 15:47:52,529:140137858733824:INFO:BaseWorkerThread:Worker thread <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller object at 0x7f74634a50a0> terminated
[cmsdataops@cmsgwms-submit8 current]$ $manage db-prompt wmagent
...
MariaDB [wmagent]> select * from wmbs_pnns where pnn='T3_KR_KISTI';
+-----+-------------+
| id  | pnn         |
+-----+-------------+
| 129 | T3_KR_KISTI |
+-----+-------------+
1 row in set (0.000 sec)

Related code: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobAccountant/AccountantWorker.py#L810-L815 https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobAccountant/AccountantWorker.py#L76 https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Locations/GetPNNtoPSNMapping.py

amaltaro commented 3 months ago

@d-ylee as we are discussing over Mattermost, based on this:

$ grep -rI T3_KR_KISTI /cvmfs/cms.cern.ch/SITECONF/*
/cvmfs/cms.cern.ch/SITECONF/T2_KR_KISTI/JobConfig/site-local-config.xml:        <phedex-node value="T3_KR_KISTI"/>  

I'd suggest to add this PSN vs PNN map into all the agents. The SQL statement is (tested in MariaDB:

INSERT INTO wmbs_location_pnns (location, pnn) 
  VALUES(
    (SELECT id from wmbs_location where site_name='T2_KR_KISTI'),
    (SELECT id from wmbs_pnns where pnn='T3_KR_KISTI')
  );

and for Oracle (CERN agents) we need to also commit those changes, in order to take effect (commit;).

d-ylee commented 3 months ago

@amaltaro I have added the PSN PNN map to both of the FNAL agents and CERN agents. I have also restarted JobAccountant. So far so good.

amaltaro commented 2 months ago

Given that we have updated this map in all the relevant agents 2 weeks ago, should we now close this out? @d-ylee

Ideally we should have a mechanism to automatically cover such cases, but I don't have any straight forward idea at the moment other than scan SITECONF files and make a map out of those (in addition to CRIC). Which isn't something that makes me very comfortable with...

d-ylee commented 2 months ago

Sounds good. Which one is more the source of truth (SITECONF vs CRIC)?

amaltaro commented 2 months ago

CRIC is the resource catalog, while SITECONF gives site admin freedom to implement whatever storage-related configuration they want. Another difference is that CRIC provides RESTful APIs, SITECONF is only a GitLab(?) repository.

I am closing this out then, but feel free to post any further questions/comments that you might have. Thanks for taking care of this, btw.

amaltaro commented 2 months ago

Same problem happened in the new agents, vocms0282 and vocms0283:

Message: PNN doesn't exist in wmbs_pnns table: T3_KR_KISTI (investigate)

I have just added that resource with the following command and restarted JobAccountant:

INSERT INTO wmbs_location_pnns (location, pnn) 
  VALUES(
    (SELECT id from wmbs_location where site_name='T2_KR_KISTI'),
    (SELECT id from wmbs_pnns where pnn='T3_KR_KISTI')
  );
amaltaro commented 2 months ago

Likewise for submit6, map added and component restarted.

amaltaro commented 2 months ago

@stlammel Stephan, can you please clarify what are the requirements - if any - for registration of storage elements (or PNN, or RSE) in SITECONF and CRIC? Shouldn't those 2 be coupled together?

As you can see in this issue, we happened to stage data out to T3_KR_KISTI, which does not have any registration in CRIC neither a map of which CPU resources it's supposed to be associated to (through CRIC). Based on that, we/WMAgent have no way to map that storage element to CPU resources - for the upcoming tasks in a workflow.

At this moment, the authoritative data source for CPU and Storage resources - and their map - is CRIC, through this REST API: http://cms-cric.cern.ch/api/cms/site/query/?json&preset=data-processing

Is there any way that we can ensure that phedex-node names used in SITECONF are properly reflected in CRIC? This way we would be able to automatically track this in the agent and avoid such crashes of components in the agents.

stlammel commented 2 months ago

Hallo Alan, T3_KR_KISTI is a storage only site. There is nothing wrong with this, we have many such Tier-3 sites. Tier-3 storage is usually not used for production activities. T3_KR_KISTI dedicates no storage space for central CMS operations. Why are you trying to use T3_KR_KISTI storage? (T3_KR_KISTI is properly defined in CRIC. We don't use CRIC for storage definition/RSE but SITECONF.) Thanks, cheers, Stephan

amaltaro commented 2 months ago

@stlammel I forgot to point out the following SITECONF configuration:

$ grep -r T3_KR_KISTI /cvmfs/cms.cern.ch/SITECONF/* | grep 'phedex-node'
/cvmfs/cms.cern.ch/SITECONF/T2_KR_KISTI/JobConfig/site-local-config.xml:        <phedex-node value="T3_KR_KISTI"/>  
/cvmfs/cms.cern.ch/SITECONF/T3_KR_KISTI/JobConfig/site-local-config.xml:        <phedex-node value="T3_KR_KISTI"/>  

so we are writing to that storage because T2_KR_KISTI defines T3_KR_KISTI as a fallback stage out.

Based on that, I assume that T2_KR_KISTI CPU resources are associated to T3_KR_KISTI.

(T3_KR_KISTI is properly defined in CRIC. We don't use CRIC for storage definition/RSE but SITECONF.)

If we want to use it in production, it is not. As aforementioned, it lacks a map of CPU and Storage and WMAgents can't do magic without that.

stlammel commented 2 months ago

Hallo Alan, ok, so there was a stage-out failure and fallback stage-out to the T3_KR_KISTI RSE triggered. You are not guaranteed to have computing resources at the RSE (and any other RSE), right? An RSE means you can transfer data in and out but not necessarily reading data from a worker node. We can ask the site admins if they want the Tier-3 storage linked to the Tier-2 but this patches up the more fundamental issue that i think we have hit/uncovered here. Thanks, cheers, Stephan

amaltaro commented 2 months ago

Stephan, I might be missing subtle details. But what is the point to have storage defined for central production if you cannot read data from the worker nodes? I would say it is then better not to even define that.

If we were not to have this protection in WMAgent, these files that are being written to that storage would remain there forever (including the workflows waiting on those), as we have NO way to say from which CPU resources we might be able to read that data.

Yes, this seems to be like a new scenario. For many years we have disk-less "sites" in CMS, but cpu-less storage is a new case.

stlammel commented 2 months ago

Hallo Alan, ahhh, having an RSE is a way for Tier-3s to subscribe/transfer data to the site and then use them on the local analysis cluster/machines, and to transfer CRAB output to the site. So, this is a common Tier-3 usage. We have them since ever. WM may not have had any stage-out failover to them before. The right approach, in my point of view, would be to submit a transfer request for files on an RSE without attached computing to another site/maybe the original site that was suppose to get the data? Thanks, cheers, Stephan

amaltaro commented 2 months ago

That would require design changes in the agent.

Let me suggest an alternative, is there any reason not to declare this T3_KR_KISTI RSE under the T2_KR_KISTI site (CPU resources)? Such that those changes would be reflected in this REST API mapping CPU vs RSEs: http://cms-cric.cern.ch/api/cms/site/query/?json&preset=data-processing

Does it have any implications on the site?

stlammel commented 2 months ago

Hallo Alan, as a patch for this time we can ask the site admins if they are ok with this. I don't know how many files triggered the failover. It might be easier/faster to just gfal-copy them to T3_KR_KISTI. But you will likely run into this again, as soon as there is a stage-out failover to another SE without CE, which is a perfectly legal config. Thanks, cheers, Stephan

amaltaro commented 2 months ago

Yes, if this site x RSE can be added to CRIC, it will save us from problems in central production whenever fallback stage out kicks in at that T2 site.

I still think we have to find a sustainable solution, hopefully parsing SITECONF or retransferring data to a different location are not the best we can do.

stlammel commented 2 months ago

Ok, i'll contact/ask the site. - Stephan

stlammel commented 2 months ago

Site replied positive and i have added the PSN-to-PNN mapping. Please double check and let me know in case anything else is needed. Thanks,

amaltaro commented 2 months ago

Perfect, thanks Stephan! Now, whenever we deploy a new agent, it will go in with this new map, as provided in CRIC: http://cms-cric.cern.ch/api/cms/site/query/?json&preset=data-processing