dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

RucioInjector component can become extremely slow #11356

Open amaltaro opened 1 year ago

amaltaro commented 1 year ago

Impact of the bug WMAgent

Describe the bug Every now and then we see performance issues with the RucioInjector components, where each interaction with the Rucio server - via HTTP REST APIs - can take a few secs (sometimes a few 10s of seconds). This causes production workflows to sit in the "file mismatch" Unified category, until agents can successfully go through that backlog and insert data into Rucio.

This morning we could spot an elapsed time of 24 seconds to insert 1 file into a block, basically each iteration of this block of code: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/RucioInjector/RucioInjectorPoller.py#L291-L306

so we need to rule out any possible delays coming from self._updateLFNState(), which is the agent oracle row update.

How to reproduce it Not clear (and right now it only affects WMAgents at CERN, not at FNAL)

Expected behavior The first objective with this issue is to identify the root cause of this slowness, then based on that, there might be many different options on how to tackle it, for instance:

Additional context and error message Right now, vocms0252 and vocms0253 are having a hard time to complete a cycle. For the record, last completed cycle in vocms0252 was at (and this is the total time it took to complete the cycle):

2022-10-25 02:48:55,043:140231764350720:INFO:BaseWorkerThread:RucioInjectorPoller took 334480.844 secs to execute

From this morning when we were watching the RucioInjector log (note the constant ~24secs per file injection):

2022-10-31 14:54:34,673:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /NMSSM_XToYHTo2W2BTo2Q1L1Nu2B_MX-800_MY-300_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL18RECO-106X_upgrade2018_realistic_v11_L1v1-v2/AODSIM#a6e93581-f87a-45d4-b17a-63c972328fb4
2022-10-31 14:54:58,714:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /NMSSM_XToYHTo2Tau2B_MX-2400_MY-70_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM#9670c832-93d9-409c-8b11-b5aad6972a6d
2022-10-31 14:55:22,774:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /NMSSM_XToYHTo2B2Tau_MX-2600_MY-60_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM#c32fe6f8-b4e9-4c57-8415-201e3c41a559
2022-10-31 14:55:49,811:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /ZH_HToZZTo2Nu2X_2LFilter_4LVeto_M600_TuneCP5_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM#2f1e0493-fb69-4270-9956-909f17fc7bb0
2022-10-31 14:56:13,935:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /NMSSM_XToYHTo2B2Tau_MX-700_MY-500_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16RECOAPV-106X_mcRun2_asymptotic_preVFP_v8-v2/AODSIM#fca3e101-2d70-46d9-939f-64c68b239f94
2022-10-31 14:56:37,885:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /NMSSM_XToYHTo2W2BTo2Q1L1Nu2B_MX-2800_MY-500_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM#3969eba0-206e-4015-b6ce-581a220d1218
2022-10-31 14:57:01,949:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /NMSSM_XToYHTo2W2BTo2Q1L1Nu2B_MX-2400_MY-1600_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v2/AODSIM#f05b78f6-929f-46e5-af61-f247c961a93d
2022-10-31 14:57:26,014:140231764350720:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /NMSSM_XToYHTo2B2Tau_MX-4000_MY-125_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM#7242ea99-7e95-429c-8ef2-aae30d041888
amaltaro commented 1 year ago

I am debugging the file insertion step in vocms0252 and I should have some more information on this in the coming hour or so.

vkuznet commented 1 year ago

@ericvaandering , it would be useful to have your input on this issue. In particular, I'm interesting in the following information:

amaltaro commented 1 year ago

PS.: I updated the initial description with a log snippet from this morning (timestamp is CERN time).

ericvaandering commented 1 year ago

@ericvaandering , it would be useful to have your input on this issue. In particular, I'm interesting in the following information:

  • the throughput stats from Rucio REST APIs, e.g. how many request per second it can sustain

We've never tried to measure this and of course it very much depends on what the request is.

  • any average stats, e.g. mean/median, Rucio APIs take to process any given block/file

I thought we had metrics which were on the level of the URI and the GET/PUT/DELETE method. But I'm not finding them at the moment. Maybe they were lost in the transition to flask. I've asked.

  • timber like dashboard, e.g. see CMSWEB timber, which can provide current state of Rucio server load and provide breakdown of current clients based on API, IPs, DNs, etc. such that we may correlate activity of WMAgents and Rucio instances.

We run fluentd on the Rucio system, so I presume the logs are somewhere and someone from the monitoring group could easily build the parsers to make such a dashboard.

To answer a question in your mail, we also run kube-eagle so you should be able to see if there is a correlation between what you see and the pod or node loading. The pods in question are server-rucio-server-[hash]-[hash] and the nodes begin with cmsrucioprod1

vkuznet commented 1 year ago

Eric, the kube-eagle plots do not give any useful information about individual APIs or requests. They only show the state of the pod/node that's it. The pod may be fine or not, while API time/CPU/RAM usage can deviate based on a load. The fluentd logs is again is something you control and anyone outside Rucio have no idea about their format. Therefore, it should be you/Rucio team which should process them and push appropriate metrics to MONIT/ES. For instance, that what we do with CMSWEB, i.e. we process cmsweb logs and push these metrics to MONIT. Then, Monitoring team can build out of them appropriate timber dashboard. I suggest that you take CMSWEB timber dashboard and try to extract similar information from your Rucio logs (fluentd) and push them to MONIT. @mrceyhun can provide current CMSWEB schema of metrics we extract from CMSWEB. Since Rucio is not part of CMSWEB we do not have these metrics by default.

It would be desired if you'll open such request and provide necessary metrics per API. For instance, since your service process the request you can easily dump into log or prometheus each request time along with CPU/RAM snapshots. This can be done using psutil python module. In particular, I provided ProcessStats for WMcore services which can capture on thread level useful stats. I hope you can investigate and request from Rucio to provide such monitoring metrics. As you can see it will tremendously help to understand the current state of Rucio servers and will be especially useful during debug process. I understand that it may take time to implement this but we need to have clear idea what each service is doing at any given point of time to pin-point its issues.

ericvaandering commented 1 year ago

The rucio server logs are standard Apache logs or something very similar so it’s not efficient for me to restart this from the ground up when the parser already exists.

And if this problem happens with CERN agents and not FNAL, the problem is likely in the agent or oracle, not the common element Rucio.

So I can put in a request for better monitoring from the rucio team but nothing is likely to happen until we supply the resources to do it.

Sent from a mobile device. Please excuse my brevity or transcription errors.

On Oct 31, 2022, at 12:33 PM, Valentin Kuznetsov @.***> wrote:



Eric, the kube-eagle plots do not give any useful information about individual APIs or requests. They only show the state of the pod/node that's it. The pod may be fine or not, while API time/CPU/RAM usage can deviate based on a load. The fluentd logs is again is something you control and anyone outside Rucio have no idea about their format. Therefore, it should be you/Rucio team which should process them and push appropriate metrics to MONIT/ES. For instance, that what we do with CMSWEB, i.e. we process cmsweb logs and push these metrics to MONIT. Then, Monitoring team can build out of them appropriate timber dashboard. I suggest that you take CMSWEB timber dashboard and try to extract similar information from your Rucio logs (fluentd) and push them to MONIT. @mrceyhunhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mrceyhun&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=JpyuhSIUSQEy6eICQWAU6Za7-SOUuI41Ac6HmTvIR0Eiw_S5LyY8BQlGBg5jXnrP&s=mvKDGczL7OGXP2SMltjYLBrVGH_SU5vKt9dBrCkA-8I&e= can provide current CMSWEB schema of metrics we extract from CMSWEB. Since Rucio is not part of CMSWEB we do not have these metrics by default.

It would be desired if you'll open such request and provide necessary metrics per API. For instance, since your service process the request you can easily dump into log or prometheus each request time along with CPU/RAM snapshots. This can be done using psutilhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_giampaolo_psutil&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=JpyuhSIUSQEy6eICQWAU6Za7-SOUuI41Ac6HmTvIR0Eiw_S5LyY8BQlGBg5jXnrP&s=jvCw8KhxITELRexDEXTgteC9bMDC9htXqZxHy05rMxY&e= python module. In particular, I provided ProcessStatshttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_blob_master_src_python_Utils_ProcessStats.py&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=JpyuhSIUSQEy6eICQWAU6Za7-SOUuI41Ac6HmTvIR0Eiw_S5LyY8BQlGBg5jXnrP&s=IpQiOK4zjyIdWyXQNH-JXfmiRdNKV6rBIffUiOVNudw&e= for WMcore services which can capture on thread level useful stats. I hope you can investigate and request from Rucio to provide such monitoring metrics. As you can see it will tremendously help to understand the current state of Rucio servers and will be especially useful during debug process. I understand that it may take time to implement this but we need to have clear idea what each service is doing at any given point of time to pin-point its issues.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11356-23issuecomment-2D1297435268&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=JpyuhSIUSQEy6eICQWAU6Za7-SOUuI41Ac6HmTvIR0Eiw_S5LyY8BQlGBg5jXnrP&s=U6RAFAoHHAvHZ1sV10StrmJA7AQDv0s5MDK5kq-onAI&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLSHOROGB7R2SSSPBJTWF77H5ANCNFSM6AAAAAARTIOK4U&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=JpyuhSIUSQEy6eICQWAU6Za7-SOUuI41Ac6HmTvIR0Eiw_S5LyY8BQlGBg5jXnrP&s=Vz9Ax8Q-JPx0OlwWobSaaq1eAjOr_QdbKOjUzeKEg1g&e=. You are receiving this because you were mentioned.Message ID: @.***>

amaltaro commented 1 year ago

And if this problem happens with CERN agents and not FNAL, the problem is likely in the agent or oracle, not the common element Rucio.

I initially thought it was expected, given that CERN agents acquire and run more work than the FNAL ones. However, looking at submit6, the time difference is too large. RucioInjector cycle takes around 200secs in the FNAL agent, while at CERN it hasn't yet finished since my component restart from yesterday(!), and the previous cycle took days to finish.

We might have to reach out to Kate to understand why it's behaving like this. @vkuznet are you able to follow this up with her (cc me please)?

vkuznet commented 1 year ago

@amaltaro , I doubt I have enough information to reach out Kate. I'm not sure I understand clearly the problem. If you think it is ORACLE, we must provide Kate specific pointers which should include:

mrceyhun commented 1 year ago

Hi all, as a reply to Valentin's comment above, you can see the schema of cmsweb frontend services in [here](https://monit-timber-cmsweb.cern.ch/kibana/app/kibana#/doc/9022b990-f44f-11ea-8dc5-2bfc79fa851b/monit_private_cmswebk8s_logs_frontend-2022-11-01/_doc?id=c86cc52b-b6bb-b465-b673-729fae8fa87d&_g=()) . As you know, cmsweb frontend services write their logs to files and filebeat fetch logs from these logs and send to Logstash. Logstash parses logs, apply logics and fits to the defined schema and then send to MONIT timber. In short, pipeline is: cmsweb services -> /.../logs/frontend/access_log_* -> Filebeat -> Logstash -> MONIT timber.

As Valentin said, we don't have to follow this pipeline if there are better ways to provide logs. There are so many solutions to propagate data to MONIT with or without Logstash. If there is any help needed on monitoring side, feel free to reach us.

amaltaro commented 1 year ago

Just another update on this.

I made an in-place update to the RucioInjector code with a couple of debugging lines, and after another restart of the component in vocms0252, I see that the whole cycle completed in a much shorter time (timestamp local to CERN):

2022-11-01 22:30:50,073:140359888467712:INFO:BaseWorkerThread:RucioInjectorPoller took 11237.486 secs to execute

and since then, cycles have been pretty quick (magically):

2022-11-01 22:37:45,828:140359888467712:INFO:BaseWorkerThread:RucioInjectorPoller took 115.395 secs to execute
2022-11-01 22:42:46,836:140359888467712:INFO:BaseWorkerThread:RucioInjectorPoller took 0.656 secs to execute
2022-11-01 22:47:47,824:140359888467712:INFO:BaseWorkerThread:RucioInjectorPoller took 0.634 secs to execute

About vocms0253, it's yet to complete the cycle started yesterday.

todor-ivanov commented 1 year ago

@amaltaro can you please paste a snippet or a patch of the lines you've changed. Thanks.

amaltaro commented 1 year ago

Sure, here it is:

cmst1@vocms0252:/data/srv/wmagent/current $ diff -rup apps/wmagentpy3/lib/python3.8/site-packages/WMComponent/RucioInjector/RucioInjectorPoller.py{.bkp,}
--- apps/wmagentpy3/lib/python3.8/site-packages/WMComponent/RucioInjector/RucioInjectorPoller.py.bkp    2022-10-31 16:14:47.450548013 +0100
+++ apps/wmagentpy3/lib/python3.8/site-packages/WMComponent/RucioInjector/RucioInjectorPoller.py    2022-11-02 02:07:09.237512294 +0100
@@ -301,6 +301,7 @@ class RucioInjectorPoller(BaseWorkerThre
                                                bytes=fileInfo['size'], state="A",
                                                adler32=fileInfo['checksum']['adler32']))

+                    logging.info("AMR inserting %d files on block %s", len(listLfns), block)
                     if self.rucio.createReplicas(rse=rseName, files=injectData, block=block):
                         logging.info("Successfully inserted %d files on block %s", len(listLfns), block)
                         self._updateLFNState(listLfns)
@@ -346,6 +347,7 @@ class RucioInjectorPoller(BaseWorkerThre
                 for block in migratedBlocks[location][container]:
                     logging.info("Closing block: %s", block)
                     if self.rucio.closeBlockContainer(block):
+                        logging.info("AMR setting block status as 'Closed' in dbsbuffer for %s", block)
                         self.setBlockClosed.execute(block)
                     else:
                         logging.error("Failed to close block: %s. Will retry again later.", block)

the component is behaving though, so I still cannot isolate potential problems.

todor-ivanov commented 1 year ago

Thanks @amaltaro! Oh it was just including two log messages. I was left with the impression you have found the magic spell that could actually solve the problem. Sorry for my misunderstanding. Strange indeed the lag is gone after such a minor intervention. Shouldn't be related.

amaltaro commented 1 year ago

And now with a better log in place (vocms0252), we can clearly see that the the oracle operation takes only a few miliseconds, while the rucio python client close API takes almost 10seconds to complete:

2022-11-07 19:27:30,950:140198058891008:INFO:RucioInjectorPoller:Closing block: /WplusH_HTo2LongLivedTo4G_MH-125_MFF-50_ctau-100cm_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#381532ad-6af7-4bcd-964c-04f4162c621c
2022-11-07 19:27:39,963:140198058891008:INFO:RucioInjectorPoller:AMR setting block status as 'Closed' in dbsbuffer for /WplusH_HTo2LongLivedTo4G_MH-125_MFF-50_ctau-100cm_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#381532ad-6af7-4bcd-964c-04f4162c621c
2022-11-07 19:27:39,966:140198058891008:INFO:RucioInjectorPoller:Closing block: /WplusH_HTo2LongLivedTo4G_MH-125_MFF-50_ctau-10cm_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#104a187a-fa8b-44ef-bfc4-d1322e3faec7
2022-11-07 19:27:48,967:140198058891008:INFO:RucioInjectorPoller:AMR setting block status as 'Closed' in dbsbuffer for /WplusH_HTo2LongLivedTo4G_MH-125_MFF-50_ctau-10cm_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#104a187a-fa8b-44ef-bfc4-d1322e3faec7
2022-11-07 19:27:48,970:140198058891008:INFO:RucioInjectorPoller:Closing block: /WminusH_HTo2LongLivedTo4G_MH-125_MFF-7_ctau-20cm_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#6d1e0b8a-f904-470c-95d0-cd026b227754

this is the actual call: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L430

@ericvaandering would you be able to pull any useful information from the server logs for these queries? From the rucio-clients perspective, these are a few examples:

client.close("cms", "/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#104a187a-fa8b-44ef-bfc4-d1322e3faec7")
or
client.close("cms", "/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#381532ad-6af7-4bcd-964c-04f4162c621c")

timestamps above are CERN localtime.

If you prefer to discuss this over slack, just ping me over there.

haozturk commented 1 year ago

Hi @ericvaandering @amaltaro do you have any updates on this issue? In order to provide some context from the ops perspective, I'm attaching a screenshot which shows the number of workflows whose announcements are delayed due to this slowness over the last 6 months. As you can see, it got really worse in the last 2 months.

image

amaltaro commented 1 year ago

@ericvaandering we ran a few tests in the past week and AFAIR you managed to spot some of those slow queries in the log, right? Would you have any other suggestion on how we can investigate it?

We upgraded WMAgents at the beginning of this week and I have already spotted a few RucioInjector heartbeat timeouts in the new agents. None in the FNAL agents though, still puzzled!

ericvaandering commented 1 year ago

Sorry, I don't have any new insights on this. You are using the Rucio client for this, right? All I can think of is that there is something going on with the certificates, but even that makes limited amounts of sense because the request to the Rucio server is not authenticated by certificate, just the rucio token.

amaltaro commented 1 year ago

After tweaking the nproc kernel limit (max user processes) on both vocms0281 and vocms0282, RucioInjector seems to be behaving properly. This would also explain why we see it only at the CERN agents, currently set to 4096 soft nproc, while FNAL agents have 16384.

JIRA ticket to puppetize these changes has been created: https://its.cern.ch/jira/browse/CMSVOC-413

todor-ivanov commented 1 year ago

Hi @amaltaro I do not want to kill people's enthusiasm here, but I think the Rucio injector at vocms0281 is again performing badly:

2022-12-13 21:32:20,460:140047150073600:INFO:RucioInjectorPoller:Block rule created for block: /ZHToTauTauUncorrelatedDecay_Filtered_M125_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#1dea5000-3e0c-410b-bc0e-9a94c5e662ce, at: T1_US_FNAL_Disk, with rule id: b9da2db20c9a450db63e90bb2ed94063
2022-12-13 21:32:38,460:140047150073600:INFO:RucioInjectorPoller:Block rule created for block: /ZHToTauTauUncorrelatedDecay_Filtered_M125_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#e436741b-74f8-44e6-9d1e-019ed1c782a9, at: T2_US_Purdue, with rule id: 23b2fbd815cf473eb19b4088e43eb19a
2022-12-13 21:32:38,505:140047150073600:INFO:RucioInjectorPoller:Checking if there are block rules to be deleted...
2022-12-13 21:32:38,563:140047150073600:INFO:RucioInjectorPoller:No candidate blocks found for rule deletion.
2022-12-13 21:32:38,568:140047150073600:INFO:BaseWorkerThread:RucioInjectorPoller took 57540.163 secs to execute

Again the component running cycle is huge - 57540.163 secs. which is about 16 hours. And it can take at around 20 seconds for a single Block Rule creation and sometimes even up to 50 or more.

amaltaro commented 1 year ago

That is not good! Thanks for the heads up Todor. Let me be optimistic here and assume that this happen while the VoC was puppetizing these changes (which apparently had issues during that process). I just restarted all the components to ensure that the expected limits are in place.

Please let me know if you spot further issues with this node in the coming days.

amaltaro commented 1 year ago

Changes to the nproc limit have been applied both at FNAL nodes (set to 65k, soft) and at CERN (set to 52k, soft). RucioInjector has been behaving fine since then.

I am closing this issue out, but if it pops up again in the coming days/weeks, we should reopen it.

todor-ivanov commented 1 year ago

I am sorry to say it, but I need to reopen this issue yet again:

vocms0281

2023-01-20 05:30:17,329:140193328658176:INFO:BaseWorkerThread:RucioInjectorPoller took 30949.271 secs to execute

While at the same time the respective component polling cycle latency for the same component at a FNAL agent is:

2023-01-20 08:54:29,179:140090104010496:INFO:BaseWorkerThread:RucioInjectorPoller took 237.586 secs to execute
amaltaro commented 1 year ago

Thanks Todor. I don't plan to work on it in the coming days, so I just removed myself from the assignee such that it's free to whoever has a free slot.

todor-ivanov commented 1 year ago

Today I was looking at the system, debugging some different issues, and I had to look at a fairly new agent: vocms0256. And since this was a fresh one, I managed to notice in the RucioInjector's logs a gradual decline of the component performance, strongly dependent on the number of write operations the component performs to Rucio. A single container or block level rules creation varies between 3-6 seconds. Here are few examples:

2023-01-24 00:39:26,740:140216297350912:INFO:RucioInjectorPoller:Creating container rule for /NMSSM_XToYHTo6B_MX-700_MY-400_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM against RSE (tier=2|tier=1)&cms_type=real&rse_type=DISK
2023-01-24 00:39:30,117:140216297350912:INFO:RucioInjectorPoller:Container rule created for /NMSSM_XToYHTo6B_MX-700_MY-400_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM under rule id: ['88e642e6e77641eca0d4f0aab6c1c302']

In contrast to the FNAL agents, where those operations take much less than a second:

2022-11-16 01:56:14,189:139649909565184:INFO:RucioInjectorPoller:Creating container rule for /NMSSM_XToYHTo2B2WTo2B4Q_MX-2800_MY-450_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v1/NANOAODSIM against RSE (tier=2|tier=1)&cms_type=real&rse_type=DISK
2022-11-16 01:56:14,570:139649909565184:INFO:RucioInjectorPoller:Container rule created for /NMSSM_XToYHTo2B2WTo2B4Q_MX-2800_MY-450_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v1/NANOAODSIM under rule id: ['638401dc2f6e46a89ac9f2e81981b5cd']

And I have not tried to plot the dependency, but at a first glance it seems to be leaner to the number of write operations we have piled up in the component. And the more the component lags the more of those pileup.... we are in a kind of spiral trend here.

FYI @amaltaro @vkuznet @khurtado

vkuznet commented 1 year ago

Todor, it may be a good observation but it certainly does not provide further insight about the issue. I rather prefer to see a manual call to Rucio to inject some info and see how long it takes. In other words, is it possible to write stand-along script or small python code which will make call to rucio in a similar way as RucioInjectorPoller does. Doing this way we can measure time spend on Rucio call and run it on different nodes, e.g. at CERN VM or FNAL one.

todor-ivanov commented 1 year ago

Hi @vkuznet I'll get back to it soon. We are currently debugging a separate problem related to the lack of pressure in the system. Indeed, a manual call to Rucio was my next step planed as well.

amaltaro commented 1 year ago

The previous in-place change that I had made in vocms0256 got reverted by Puppet. So here goes a request: https://its.cern.ch/jira/browse/CMSVOC-430

to temporarily disable it in the CERN agents such that we can finally confirm whether limits.d settings have any effect on this or not.

amaltaro commented 1 year ago

Now that we got vocms0255 back functional (after rebuilding the machine from scratch), and disabling puppet in that node, here it goes my first change to the /etc/security/limits.d/20-nproc.conf file:

From

cmst1      soft    nproc     52000

to:

cmst1      soft    nproc     unlimited
amaltaro commented 1 year ago

Due to the inability to reproduce the RucioInjector heartbeat timeouts over the last 2 or 3 weeks (all the agents are working just fine!), I am going to close this issue out once again. I do not have any explanation as of why the heartbeat timeouts stopped though. If it reoccurs, we should reopen this ticket.

amaltaro commented 1 year ago

After seeing CERN-agent slowness once again starting last week (June 19), we see that the duty cycle of RucioInjector increases again. It takes a couple of days to start hitting the heartbeat timeout. Example from vocms0281:

2023-06-25 14:44:14,198:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 900.566 secs to execute
2023-06-25 15:00:02,999:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 648.447 secs to execute
2023-06-25 15:13:39,720:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 516.358 secs to execute
2023-06-25 15:25:18,454:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 398.350 secs to execute
2023-06-25 15:42:13,180:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 714.368 secs to execute
2023-06-25 16:00:42,023:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 808.490 secs to execute
2023-06-25 16:21:16,035:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 933.625 secs to execute
2023-06-25 17:01:54,546:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 2138.160 secs to execute
2023-06-25 17:24:22,508:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 1047.598 secs to execute
2023-06-25 18:07:57,840:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 2314.973 secs to execute
2023-06-25 18:55:00,909:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 2522.699 secs to execute
2023-06-25 19:38:25,979:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 2304.673 secs to execute
2023-06-25 20:41:58,750:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 3512.416 secs to execute
2023-06-25 22:16:22,430:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 5363.300 secs to execute
2023-06-26 01:30:49,941:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 11366.883 secs to execute
2023-06-26 07:50:12,184:139813330302720:INFO:BaseWorkerThread:RucioInjectorPoller took 22461.849 secs to execute

Note that FNAL agents are again unaffected and their duty cycle is normal and within the hearbeat timeout.

I am reopening this issue such that we can reconsider it for Q3 and hopefully identify and resolve this issue for good.