Failing analyses report CPU efficiency higher than 100%

mmylly commented 3 years ago

In T2 FI HIP site we experience average CPU efficiency higher than 100% when jobs fail with an exit code "8021 - FileReadError". We've observed these using the monit-grafana tool.

cmsbuild commented 3 years ago

A new Issue was created by @mmylly Mikael Myllymki.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 years ago

Could you elaborate a little bit? E.g. are these jobs short or long? Does this observation have any further implications (that may be specific to T2_FI_HIP, or not)? Would you have links to job logs or reports in Grafana?

mmylly commented 3 years ago

We examined this behaviour from the last months and noticed this clear correlation. We expect that these contain wide range of different analyses jobs. There's multiple occasions even during the past week as can be observed from Grafana [1]. When zooming to periods of unphysical cpu efficiency, the failure rate is really high and the main exit code is 8021. Not sure if this is the right place to report this, but we wished to hear if this is something which has been reported also earlier.

[1] https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring-es-agg-data-official?from=now-7d&orgId=11&to=now-1h&var-CMSPrimaryDataTier=All&var-CMS_CampaignType=All&var-CMS_JobType=Analysis&var-CMS_SubmissionTool=All&var-CMS_WMTool=All&var-Site=T2_FI_HIP&var-Tier=T2&var-Type=All&var-binning=1h&var-group_by=CMS_CampaignType&refresh=15m

belforte commented 3 years ago

there's been instances in the past where the scripts feeding info to Grafana were not taking proper account of multiple core usage, thus leading to >100% CPU efficiency numbers. I thought it was fixed though. And indeed correlation with a particular CMSSW exit code is hard to rationalize. I suggest to hand the problem over to CMS Computing Monitoring people and have them dig out a couple of specific jobs which exhibit this behavior, rather than an aggregated metric. So that details can be figured out. @leggerf @mrceyhun

leggerf commented 3 years ago

indeed. Unfortunately Ceyhun is away on sick leave for a few more weeks, and we are seriously understaffed atm.

We do have a dashboard that can be used to monitor outliers:

https://cmsdatapop.web.cern.ch/cmsdatapop/cpu_eff/analysis/CPU_Efficiency_Table.html

You’re welcome to use it and isolate problematic jobs (go to "Outliers”). If you find issues with monitoring, please open a ticket to our JIRA:

https://its.cern.ch/jira/projects/CMSMONIT/

with a clear description of the issue, and we can address it.

cheers Federica

On 7 Oct 2021, at 15:17, Stefano Belforte @.***> wrote:

there's been instances in the past where the scripts feeding info to Grafana were not taking proper account of multiple core usage, thus leading to >100% CPU efficiency numbers. I thought it was fixed though. And indeed correlation with a particular CMSSW exit code is hard to rationalize. I suggest to hand the problem over to CMS Computing Monitoring people and have them dig out a couple of specific jobs which exhibit this behavior, rather than an aggregated metric. So that details can be figured out. @leggerf https://github.com/leggerf @mrceyhun https://github.com/mrceyhun — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/35531#issuecomment-937783130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4EWQJUOO462D7I4Y2X4TDUFWMVHANCNFSM5FKASW4Q. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mmascher commented 3 years ago

Where is the efficiency taken here in this graph: https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring-es-agg-data-official?from=now-7d&orgId=11&to=now-1h&var-CMSPrimaryDataTier=All&var-CMS_CampaignType=All&var-CMS_JobType=Analysis&var-CMS_SubmissionTool=All&var-CMS_WMTool=All&var-Site=T2_FI_HIP&var-Tier=T2&var-Type=All&var-binning=1h&var-group_by=CMS_CampaignType&refresh=15m&viewPanel=53

Is it data.CpuEff of monit_prod_condor_raw_metric in ES?

If so, this is taken from RemoteSysCpu plus RemoteUserCpu (https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L776-L778), which in turn are condor job classads that are set by condor: https://htcondor.readthedocs.io/en/v9_0/classad-attributes/job-classad-attributes.html

leggerf commented 3 years ago

the definition is

(data.sum_cpuTimeHr/data.sum_coreHr)*100.

from:

https://monit-kibana-acc.cern.ch/kibana/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-15h,to:now))&_a=(columns:!(_source),index:'3650ccf0-82e8-11ea-88fc-cfaa9841e350',interval:auto,query:(language:kuery,query:''),sort:!(metadata.timestamp,desc))

(is basically the same as you reported, but comes from the aggregated ES data source, where we save limited number of fields)

Il 08/10/2021 10:44, Marco Mascheroni ha scritto:

Where is the efficiency taken here in this graph: https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring-es-agg-data-official?from=now-7d&orgId=11&to=now-1h&var-CMSPrimaryDataTier=All&var-CMS_CampaignType=All&var-CMS_JobType=Analysis&var-CMS_SubmissionTool=All&var-CMS_WMTool=All&var-Site=T2_FI_HIP&var-Tier=T2&var-Type=All&var-binning=1h&var-group_by=CMS_CampaignType&refresh=15m&viewPanel=53 https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring-es-agg-data-official?from=now-7d&orgId=11&to=now-1h&var-CMSPrimaryDataTier=All&var-CMS_CampaignType=All&var-CMS_JobType=Analysis&var-CMS_SubmissionTool=All&var-CMS_WMTool=All&var-Site=T2_FI_HIP&var-Tier=T2&var-Type=All&var-binning=1h&var-group_by=CMS_CampaignType&refresh=15m&viewPanel=53

Is it data.CpuEff of monit_prod_condor_raw_metric in ES?

If so, this is taken from RemoteSysCpu plus RemoteUserCpu (https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L776-L778 https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L776-L778), which in turn are condor job classads that are set by condor: https://htcondor.readthedocs.io/en/v9_0/classad-attributes/job-classad-attributes.html https://htcondor.readthedocs.io/en/v9_0/classad-attributes/job-classad-attributes.html

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/35531#issuecomment-938459793, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4EWQMH4UJWKZBKXC5JWR3UF2VQFANCNFSM5FKASW4Q. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mmylly commented 3 years ago

Thank you for the comments, we managed to track down some individual jobs where this was experienced and we will now investigate this further. We will also try this dashboard you recommended and report back if the we find the cause.

mmylly commented 2 years ago

We performed some deeper investigation studying "HammerCloud"-test jobs. We compared the cpuEff values reported in Grafana and the ones obtained from slurm ("sacct -j $slurmid") and ARC6 ("arcctl accounting job info $arcid") and found out that there is no clear correspondence.

However, we tracked down that in multiple occasions the WallTime value reported by ARC was zero even when slurm reported sensible value. This would make it possible to obtain cpuEff values larger than 100%.

We are going to report this to the right people.

cms-sw / cmssw

Failing analyses report CPU efficiency higher than 100% #35531