Review and improve CMSSW I/O metric reporting

makortel commented 1 year ago

Impact of the bug

CMSSW I/O metric reporting is inaccurate.

Describe the bug

(followup from https://github.com/dmwm/WMCore/pull/11533#issuecomment-1500437122)

The I/O metrics extract from the framework job report in https://github.com/dmwm/WMCore/blob/431f08f70f14733830e546d0ac4b1adc04fc912c/src/python/WMCore/FwkJobReport/XMLParser.py#L446-L462 need to be reviewed, taking into account e.g.

exact meaning of each Timing-* number
what other numbers (that are reported further) are derived from those
why some quantities are read for some protocol (readMethod), and some from tstoragefile

In addition, the readMethod used in some of the quantities seems to not be well defined as the loop https://github.com/dmwm/WMCore/blob/431f08f70f14733830e546d0ac4b1adc04fc912c/src/python/WMCore/FwkJobReport/XMLParser.py#L427-L432 picks the first protocol it sees that had non-zero number of operations. For a job reading a local file and a remote file, like pileup mixing in step chain the local file protocol comes before the remote root protocol in the XML (and more generally these numbers report statistics from any files opened by cmsRun, not just the event data).

How to reproduce it

N/A

Expected behavior

The quantities reported onwards by WMCore have a well-defined and documented meaning.

Additional context and error message

How much do these performance metrics matter for WMCore itself, or does it just treat them as "black box" defined by someone else?

amaltaro commented 1 year ago

@makortel thank you for creating this issue, Matti. I think this code - or part of it - is likely more than 10 years old.

I would say that WMAgent is mostly propagating these FJR metrics upstream, so we would need to hear back from you/Core CMSSW team the following:

what are the relevant I/O metrics that we need to collect from the FJR (these will get published to at least WMArchive)
we might want to review the CPU performance metrics as well (we might want to decide to have a specific issue for that)
are those documented anywhere?
can we keep renaming some of those (like converting bytes to MB, and so on)

this way, we can agree and ensure that the relevant FJR metrics are consumed and published upstream, including a documentation of Core Framework + WMAgent metrics.

vkuznet commented 1 year ago

@makortel , I'll start working on this issue and to start I need an example of FWJR coming from CMSSW along with questions Alan posted in his comment. Please provide this info at your convenience.

makortel commented 1 year ago

@vkuznet @amaltaro

At this time I'm not able to give full answer to the questions of Alan in https://github.com/dmwm/WMCore/issues/11538#issuecomment-1502569727

I attached an example job report (from CMSSW_13_1_X_2023-04-24-1100 IB, workflow 11834.0 step 3) here JobReport3.xml.txt

what are the relevant I/O metrics that we need to collect from the FJR (these will get published to at least WMArchive)

At this time I don't have a good answer. We'd need to think about these out in core software as well (may well happen only after CHEP).

we might want to review the CPU performance metrics as well (we might want to decide to have a specific issue for that)

are those documented anywhere?

With "those" do you refer to the I/O metrics or CPU performance metrics? The I/O metrics are, in principle, documented in https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideStorageStats, but given its age I think we (core) should refresh our memory (even if the storage system has not evolved much since that time)

can we keep renaming some of those (like converting bytes to MB, and so on)

I don't object as long as the unit is clear.

vkuznet commented 1 year ago

@amaltaro , @makortel , I used Matti's xml file and it provides the following Timing- metrics:

Timing-file metrics related to IO on the input LFN
Timing-root metrics related to IO of root operations over LFN
Timing-tstoragefile metrics related to IO storage file operations

The WMCore XMLParser will take first non-zero metric and use it in construction of FWJR report JSON. In this specific case it picked Timing-file-read-numOperations metric. In my view all of these metrics should be reported to upstream, e.g. WMArchive, since none of us, and neither Core software or WMCore groups, have a-prior knowledge how these metrics should be used. We may guess and pick one or few combinations but it will simply cut off entire picture.

Therefore, my suggestion would be to keep current code intact for backward compatibility, but add new sections to FWJR like:

{"file-io": {...}, "root-io": {...}, "storage-io": {....}, ...}

Doing this way we will

preserve existing upstream clients/dashboards which relies on current FWJR structure, and
we'll enhance FWJR with specific IO metrics which we may used in a future studies, e.g. dashboard. The new categories will clearly define IO operations and remove ambiguity of their interpretation.

If you agree with such proposal I can make necessary changes to the code to introduce these groups. We'll also need to adjust WMArchive schema (part of WMCore) and add them on a MONIT side to preserve their appearance in ES/HDFS.

vkuznet commented 1 year ago

With this PR https://github.com/dmwm/WMCore/pull/11575 I provide additional IO and Site metrics to FWJR. Please check full report here.

      "performance": {
        "storage": {...},
        "memory": {...},
        "cpu": {...},
        "multicore": {},
        "io": {...},
        "site": {...},
      }

So far, I put in flat structure the IO and site since usually in MONIT we prefer flat structure. The IO part contains all Timing-XXX metrics, while site part contains XrdSiteStatistics metrics.

Let me know your feedback about it.

makortel commented 1 year ago

Do I understand correctly that https://github.com/dmwm/WMCore/pull/11575 just forwards all the performance numbers from the job report XML file to the JSON file?

I suppose this approach would mean that any derived quantities (that are currently calculated in https://github.com/dmwm/WMCore/blob/431f08f70f14733830e546d0ac4b1adc04fc912c/src/python/WMCore/FwkJobReport/XMLParser.py#L446-L462) would have to be defined in the monitoring layer, did I get that right?

I mean, in general we could provide such "summary metrics"

by the framework
by the WM (as done currently)
by the monitoring

The question of exact meaning of each I/O metric comes back to the Core software, as, at least currently, does also the question of what derived quantities are useful to identify and investigate CPU efficiency problems. That would favor the putting the logic to the framework level, so everyone else would see well-defined relatively easy-to-understand numbers. On the other hand, this would also be the least flexible option towards changes, as we can't change existing releases.

Calculating the derived quantities in the monitoring layer would probably be the most flexible. On the other hand, it is also farthest away from where the "raw quantities" are defined, and whenever we'd change certain parts of the framework, we'd also need to update the monitoring code accordingly, and in a way that the monitoring code can digest information from different CMSSW versions.

The current WM option is then somewhere in between.

We'd still need to do the "review" part of this issue, and I'm a bit hesitant for making changes before we have gained better understanding of the big picture of the I/O metrics.

vkuznet commented 1 year ago

Matti, your understanding is correct, but as I wrote to preserve backward compatibility I think we should preserve current WM logic. Said that, in my view, WM is a wrong layer to do the aggregation for many reason:

it consumes metrics from CMSSW
it puts its own logic to the aggregation which may or may not be desired at monitoring layer (dashboards)
it adds another layer of complexity to metrics which are supplied from upstream code, e.g. CMSSW

When I was in charge of CMS Monitoring team I always advocate for transparent storage of metrics since it is only end-user can tell what they want. Therefore, in my view you should clearly define logic/metrics at CMSSW layer, the WM should serve as a proxy, or if it does perform some other actions within WM should add its own metrics to report, and metrics should be shipped to MONIT "as is". Moreover, it is desired to keep metrics with specific prefix to have clear definition of them, e.g. if specific metrics are defined within CMSSW, let it be CMSSW_XXX, while if metric is calculated in WM let it be WM_XXX. Doing this way, we will have clear definition of metrics and let CMS Monitoring team provide appropriate dashboard for end-users based on their specific use-case(s). Said that, if CMSSW wants to change some metrics it should also follow backward compatibility path, i.e. and instead of changing existing metrics it should add whatever derivatives we may have. The metrics storage is cheap and we can easily extend existing schema.

The above approach is very successful in other layers like various exporters within Prometheus where we may have different providers of similar metrics, e.g. system metrics can be provided by different exporters based on /proc file system or based on application based ones. Having both provide much more reach representation of the monitoring layer, and aggregation can be possible across multiple dimension without affecting underlying code-base.

amaltaro commented 1 year ago

@makortel Hi Matti, apologies for not following this ticket very closely, but did we discuss what is the preferred way to deal with these CMSSW statistic metrics? Is it: a) collect everything that is provided by CMSSW FJR and push it upstream to CMS Monitoring; or b) do we want to identify what are the metrics that we consider really relevant and only collect and propagate those ?

With those IO and Site new metrics implemented in the PR linked by Valentin, should we make similar conversions as it is already done for other metrics (e.g., converting from bytes to mega-bytes, and so on)?

vkuznet commented 1 year ago

@amaltaro , I would strongly oppose to convert bytes to M/G/T bytes as it only confuses things at different level. The metrics should be in bare units without any conversions. The MONIT dashboards provides extensive conversion capabilities to make appropriate units on dashboards and if we'll make such conversion it will be difficult to maintain dashboards in a long run (when units may change and jump or downscale by order of magnitude).

vkuznet commented 1 year ago

@makortel Matti, could you please provide us update on this ticket as we need to decide how to move forward. If for whatever reason you do not have time we'll be happy to move this ticket into Q3 quarter.

makortel commented 1 year ago

Does the MONIT infrastructure allow us to define (and maintain) the higher-level metrics such that they would be easy to use in the various dashboards without everyone having to derive the high-level metrics themselves? (and e.g. if we identify a bug in some of the high-level metric definition, we can fix it only once and the fix gets automatically applied for everyone who is using it?)

vkuznet commented 1 year ago

@mrceyhun , @nikodemas , @leggerf , @brij01 could you please answer Matti's question about MONIT. I do know that Prometheus setup allows custom aggregated metrics, and we have few of those for DBS, but this is a new use case which we discuss. In my view we should stored unaggregated metrics to MONIT(OpenSearch/ES) while later build dashboard for those we need. Matti questions is slightly different, does OpenSearch/ES allows to define aggregated metrics from specific index and if so can we do that on our own (CMS side). If it does I'll assume the following data flow:

WMAgent -> CMSSW (job) -> WMAgent -> WMarchive -> MONIT -> ES/HDFS -> aggregation -> ES/HDFS

mrceyhun commented 1 year ago

Hi @vkuznet, by default, OpenSearch does not do aggregation of an index and store the results in different index. We need another service to do it.

The aggregation and then storing aggregated data in different index(agg) in MONIT -> ES/HDFS -> aggregation -> ES/HDFS is possible in 2 ways:

by consuming Messaging team's topic that we use while sending data to MONIT: consume the raw data from this topic, aggregate and send results to MONIT aggregated index.
by consuming MONIT Kafka queue, which has 3 days of retention policy, with Flume service, like we do incCondor Job Monitoring: raw data is aggregated hourly on MONIT side and stored in long-term OpenSearch index. MONIT recently informed us that they can provide access to their Kafka, however, they have lack of resources to deploy aggregation service(Flume) in their infrastructure, so it should be deployed in CMS infra. In our experience, if schema does not change, it is a very stable service.

In any case, we need an additional service to aggregate data.

My general comment for WMArchive data is that we should get rid of nested JSON schema. It makes difficult to create queries in Grafana and OpenSearch. Just because of this difficulty in WMArchive documents, we developed a new service https://cms-monitoring.cern.ch/web/cpueff which does aggregations in Spark. IMHO, if we have a chance to improve the schema, it will solve many of the problems and will limit the need of additional services to provide required monitoring.

Secondly, in monitoring point of view, it is hard to understand the data. A single WMArchive document in MONIT includes all the details of a job; including very long error log as a text, LFNArrays/LFNArrayRef as string lists, and also many numeric values in nested step arrays. If you can separate data to two index such that first one contains only flat numeric details of a job used in monitoring and second one has general job information such as error logs, LFNArrays, etc. used in debugging, it would be a great improvement for monitoring and for future aggregations. In this regard, backward-compatibility will be an issue but it will pay off in near future. In the mean time, we can continue to use and support current monitoring. In short, new flat schema will increase the data volume, so we should get rid of the parts that are not needed for monitoring.

About Prometheus usage, I think it is not a good fit. Because you need to report all job results and none of them should be missed. Therefore, you have to expose all job metrics to Prometheus at most in each 5 minutes; which I think not possible. Also Prometheus cannot store values other than numeric type like logs, string arrays, etc.. If you can expose limited set of high-level metrics, then it can be a good use-case IMO.

vkuznet commented 1 year ago

@mrceyhun , thanks for your feedback, now it is clear to me how to move forward. For the record, originally WMArchive was designed to store FWJR docs and it was not indented to be used in monitoring. Later we realized that this data is useful and now data-ops relies on it for Monitoring their activities.

Said that, here is my proposal how to move forward:

flat out WMArchive data for monitoring purposes and adjust the service to have three different streams to MONIT 3 distinct MONIT topics:
- original WMArchive data should be intact and go to MONIT and ES/HDFS
- we create flat out WMArchive data for monitoring purposes and send to to separate topic
- we will add new configuration file to WMArchive to declare aggregation, e.g. take metrics A and B and create another one via specific math operation/function. This aggregated data will to to another MONIT stream.

If we'll add such configuration in place then we can solve the issue with code modification on CMSSW or WM sides and WMArchive will become a proxy for data injection into MONIT. It will continue to supply original data for backward compatibility, it will create new stream for data monitoring needs, and it will aggregate data based on external configuration file which we can adjust as we go. The benefit of this proposal is clear, it will allow to produce as many metrics from CMSSW/WM sides as we desire and provide dynamic metrics aggregation in WMArchive which will be considered as proxy server where WM will send the data. On a MONIT side we will not deal with aggregation either (as it is quite complex based on Ceyhun information) but we'll get the aggregated data and it can be configured.

@makortel , @amaltaro, @mrceyhun please provide your feedback on this proposal, and if we agree I can transform it into set of actions.

mrceyhun commented 1 year ago

@vkuznet I agree with 3 streams. I believe all 3 will include common fields to join them like fwjr id, job name, campaign etc.. You can use same WMArchive topic and set different "metadata.type" for new 2 streams, which creates different index templates automatically without asking to MONIT.

amaltaro commented 1 year ago

Just to make sure I am following this discussion. The 3 streams of WMArchive monitoring would be:

original WMArchive data should be intact and go to MONIT and ES/HDFS

nothing changes. I understand this will provide the backwards compatibility.

we create flat out WMArchive data for monitoring purposes and send to to separate topic

who will create this new flatted out document? Is it WMAgent, MonIT, or who will consume data from one queue and put it in another? Do I understand it right that the input would be a CMSSW FJR? Or would it be the full and nested WMArchive document?

we will add new configuration file to WMArchive to declare aggregation, e.g. take metrics A and B and create another one via specific math operation/function. This aggregated data will to to another MONIT stream.

same question as above. Who will be responsible for aggregating data? Is it WMAgent, MonIT, or who will consume data from one queue and put it in another?

In addition to these question, I think we still have to agree on which metrics we want to track and which ones we want to aggregate.

So, the original question still remains. Should we just collect EVERYTHING that CMSSW can provide us and push it upstream? Or should we collect only information that we deem necessary to push upstream and use in monitoring?

vkuznet commented 1 year ago

The WMArchive will become proxy between WM and Monit. As such, this service will consume unfiltered WM data and perform flattering and aggregation. Therefore, WM will collect CMSSW, and send all possible metrics. It can be part of WMArchive schema or not. Then WMArchive will perform filtering of data, and split it into 3 streams. The second item in my list is to convert existing WMArchive FWJR record into flatten form. This will be done by WMArchive. And, we only need to agree where to include CMSSW metrics, I suggest to add them to FWJR record we send to WMArchive as independent section. Doing this way we'll have everything presented in a record, and then WMArchive will perform the job of splitting, flatten, aggregation and send it over to MONIT.

makortel commented 1 year ago

we will add new configuration file to WMArchive to declare aggregation, e.g. take metrics A and B and create another one via specific math operation/function. This aggregated data will to to another MONIT stream.

Where is the code or logic for the aggregation hosted? How is it deployed? How can it be tested?

vkuznet commented 1 year ago

Matti, there is no yet code for aggregation and needs to be added to WMArchive. Said that, I would like to propose simple JSON syntax to specify common (or custom aggregation) function along with list of metrics to consume, e.g.

[
   {"aggregated_metric": "its_name", "metircs": [list of metrics to use], "function": "sum"},
...
]

Doing this way, the most common math function can be easily added, while if we need a custom function we'll add it manually to aggregated module. Since we host WMArchive on k8s, the deployment will only require new configuration for aggregated metrics (if we use common math functions) and new aggregated module if we'll add new custom function. And, we'll redeploy WMArchive service and neither touch WM/CMSSW or MONIT side. Then the data flow will contain new aggregated records which we may consume in different form.

makortel commented 1 year ago

I understand the code for the aggregation does not exist yet. I was rather trying to understand, if we were to follow this route, what the development and maintenance of the said logic would entail. (note that I don't know anything about WMArchive)

Do I understand correctly that if we need to apply a fix, or otherwise update the aggregation logic, the changes will apply only to new monitoring data, i.e. whatever data had been collected before the update, will stay as it was?

vkuznet commented 1 year ago

The WMArchive codebase is here: https://github.com/dmwm/WMArchive, the original implementation was python, the new one is Go based which only couple of pages of code for entire service, see https://github.com/dmwm/WMArchive/blob/master/src/go/wmarchive.go

The addition will require to produce new aggregate module to parse JSON structure and manipulate the data, the rest is already in WMArchive server code, e.g. how to handle incoming data streams and pass it to MONIT via Stomp AMQ.

Your understanding of fix logic is correct, if we need to change something it will be applied to new data, the previous data which reside in MONIT will stay intact. Said that, if we need to back-port the logic we'll need to write additional code which will get data from MONIT (HDFS) and inject it back to WMArchive or OpenSearch/ES. This is the way as do other aggregation (like site aggregation via Spark job on HDFS which run periodically on CMS Monitoring cluster). Therefore, the process will not be new and Monitoring team can extend it.

The main advantage of storing raw metrics is that it can be reprocessed again and again, while if we'll add aggregation to CMSSW or WM, then we can't recover anything since original information will be lost forever.

makortel commented 1 year ago

Thanks @vkuznet. From the CMSSW framework side I think the overall plan sounds good. We'll certainly need some help when getting started with the definitions of the high-level aggregations. And for that we still need to go through the code to understand/remind the meaning of the "raw metric numbers", and which high-level aggregations would make sense. Until then I'd ask the raw numbers to not to be propagated forward from WMArchive (except maybe something for testing purposes to make sure the whole chain works).

In addition of the I/O metrics (let me know if these would warrant their own issue), in order to better understand the various CPU efficiency numbers, it would be useful to record and show in the monitoring

begin and end timestamps (or begin timestamp and duration) of the WM program (sorry for the wrong term) that is run in the HTCondor job, and that calls cmsRun
begin and end timestamps (or begin timestamp and duration) of each cmsRun invocation
(bonus) total CPU time spent by the WM program (again sorry for the wrong term)

I'm not sure if any of these quantities are being recorded already, but if not, would it be feasible to add? (we'd also be interested of similar timekeeping by the pilot that runs the HTCondor jobs, but that's likely for other folks to talk to?)

amaltaro commented 1 year ago

I think we have identified the short and longer term developments, and the relevant follow up tickets have also been created. With that, we can now move this issue back to "In Progress" (from Waiting) and resume working on the already opened PR, such that it comes in in the next WMAgent stable release (beginning/mid of July). @vkuznet

klannon commented 1 year ago

This issue has been in waiting since May 22. @vkuznet, can you please summarize what this is waiting on?

vkuznet commented 1 year ago

@klannon , I had 2 slots of vacations, and Matti was on vacation too, then we had several iterations on implementation. This was discussed during today's WM meeting and action item on me was to prepare summary doc about these and other metrics. You can find its first draft over here. Right now, I'm awaiting final review of my PRs (from Alan) and approval from CMS Monitoring to propagate these metrics to MONIT.

amaltaro commented 1 year ago

I moved its status in the project board to "In Progress". If we get to a point that we cannot merge and/or deploy it because it has external dependencies (e.g. psutils in the worker nodes), we should move it back to Waiting.

vkuznet commented 1 year ago

Alan, I think closing of this issue was premature and should be done only when https://github.com/dmwm/WMCore/pull/11663 is merged, otherwise we can't get CMSSW metrics.

vkuznet commented 1 year ago

For the record, here is new issue https://github.com/dmwm/WMCore/issues/11758 to refer to required clean-up of performance metrics.

amaltaro commented 1 year ago

Yes, there were multiple PRs closing this issue, so once one of them got merged, it automatically closed this ticket. Now we have both PRs merged and this issue is correctly closed.

For clarity, this issue has been closed with https://github.com/dmwm/WMCore/pull/11696 and https://github.com/dmwm/WMCore/pull/11663

dmwm / WMCore

Review and improve CMSSW I/O metric reporting #11538