CMSCompOps / wtc-console

MIT License
1 stars 1 forks source link

Uploading Errors to CMSMONIT #3

Closed dabercro closed 5 years ago

dabercro commented 5 years ago

We can look into uploading information for each workflow to CMSMONIT. This will allow us to use Grafana to:

Ticket with Valentin to get the information and table dashboard is here: https://its.cern.ch/jira/browse/CMSMONIT-45

dabercro commented 5 years ago

@phylsix Let me know if you have questions about the information in the ticket. I can make a first script that puts the JSON file together. We can track progress here.

phylsix commented 5 years ago

Hey, as far as I can understand from the link above, grafana is time-series oriented, also the value of the cells in the table is restricted to numbers. A lot of times we want to save or display strings, is it possible?

dabercro commented 5 years ago

Yeah, that is one thing that Valentin pointed out as different from our general use case. I think the time series can still be useful to see how errors are reported (especially for ACDCs), and we can still make tables with just the latest timestamped information.

I don't know if there's an easy way to display strings with Grafana, but Valentin said we can also run python scripts directly on the Spark cluster to do whatever analysis we want later. I think it's worth saving some of the relevant string data until then. In particular, Sharad uses the log snippets to make decisions.

phylsix commented 5 years ago

Ok, so there is no problems in saving strings. That would be good for now.

dabercro commented 5 years ago

I'm going to spend some time trying to learn Grafana through the tutorial here: http://docs.grafana.org/guides/getting_started/

For a first pass, I will just try to push the information here: https://github.com/dabercro/WorkflowWebTools/commit/d0a81f92c853127ba0e378e20909dd3270b4de5a

phylsix commented 5 years ago

During the query to get consistent failure rate of a workflow with the number showed up in wmstats page, I noticed two differences on how to treat error counts with respect to workflowwebtools:

  1. 'NotReported' are not treated as errors (interpreted as -1 in workflowwebtools)
  2. Errors happened in 'LogCollect' or 'Cleanup' type step are also included into rate calculation. (not in workflowwebtools)

I followed the same route as wmstats on these two points, and extracted relevant information of a workflow that could potentially be useful for OSDroid to a JSON, which includes failure rate, failure keywords, status, tasks and error codes, counts, site etc. An example can be found on vocms0116:/home/wsi/WorkflowWebTools/bin/vomit.json.

Do you think this would be a good information to be stored/monitored on periodical/hourly basis?

dabercro commented 5 years ago

Yeah, I think this all would be interesting to keep.

Good catches. The LogCollect and Cleanup were asked to be filtered out from the webpage, but it probably doesn't hurt to keep the data around. Also, the NotReported errors are retrieved from CouchDB: https://github.com/CMSCompOps/WorkflowWebTools/blob/master/workflowwebtools/workflowinfo.py#L278 We should probably include those in the information we save too.

phylsix commented 5 years ago

Ok, I saved NotReported sites list as siteNotReported for each task of a workflow now.

dabercro commented 5 years ago

@phylsix To get the list of workflows to cycle through, you can connect to the Unified database like this: https://github.com/CMSCompOps/WorkflowWebTools/blob/b9eb16f687ce6dcd892b37f0fdb034f890d72436/workflowwebtools/errorutils.py#L56-L61

You won't need the STATUS constraint, I think.

I will send you over Skype where the connection parameters can be found.

dabercro commented 5 years ago

@vlimant If we want to use this for OS Driod also, what statuses should we be looking at? I'm guessing it's more than just "manual", but we can't loop over all 100k workflows in the database either. Is 'STATUS NOT LIKE "%unlock%"' a good subset?

dabercro commented 5 years ago

@phylsix Can you link to a workflow JSON at this ticket? https://its.cern.ch/jira/browse/CMSMONIT-45

phylsix commented 5 years ago

@dabercro OK, I just added a comment with an link.

One question: I tried to access the oracle db with the parameters you provided after pulling the latest master branch on vocms0116, but failed. It shows the following message:

oracle_db_conn = cx_Oracle.connect(*config_dict['oracle']) # pylint:disable=c-extension-no-member cx_Oracle.DatabaseError: DPI-1047: 64-bit Oracle Client library cannot be loaded: "libclntsh.so: cannot open shared object file: No such file or directory". See https://oracle.github.io/odpi/doc/installation.html#linux for help

I wonder if this is related with environment issue or something else? Do you happen to know?

dabercro commented 5 years ago

Ah, there's another package I had to install to get it to work (oracle-instantclient-basic). It's on vocms0116 now. Let me know if it still doesn't work for you.

phylsix commented 5 years ago

Valentin's last comment at https://its.cern.ch/jira/browse/CMSMONIT-45 was about how we would like to aggregate/display error data of interesting workflows (An example can be seen at http://wsi.web.cern.ch/wsi/public/toSaveExample3.json). As it can affect storage strategy and document structure, I guess.

Below it's my thought:


Data collection, what to save:

Periodically launch (0.5h),

Data display, how to aggregate:

@dabercro Do you have any suggestions/ideas?

dabercro commented 5 years ago

This looks like a pretty good list of monitoring plots to have!

One more thing which the monitoring team already knows about is the table (example table https://monit-grafana.cern.ch/d/XcvyO_-mk/_demo-table?orgId=11) which would have error codes on one axis, site names on the other, and number of errors in the cell. From your example JSON, we should be able to make that.

Do you know how long it takes to put together a JSON file? Running over everything every half hour sounds ambitious to me, but I could be wrong.

One more thing that might be worth saving is the job parameters (https://dabercro.web.cern.ch/dabercro/data/jsondumps/params.json), but these won't change over time, so they should probably be in a separate collection anyway. Related to this, I have a quick question: Do you calculate the failure rate or is that given directly somewhere in WMStats?

phylsix commented 5 years ago

I did a test, by connecting the oracle db you provided, it costs about 8min15s to parse 245 workflows, while only handing the ones whose failure rate > 20%.

Yeah, I think job parameters can be saved in a separate collection, which can be queried in some way. The failure rate is a number one can see in WMStats page, but I cannot find an api for it. So I work out a way to calculate it from number of success and number or errors and be able to be consistent. If you know a way to get the number directly, let me know and I can use that for sure.

dabercro commented 5 years ago

Okay, so if I do the following:

select count(*) from CMS_UNIFIED_ADMIN.workflow WHERE lower(STATUS) NOT LIKE '%unlock%';

I get 3323 workflows right now. I think we will probably want to run this every two or three hours then.

Also, does CMSMONIT include the timestamp for you when you upload info, or should we add a timestamp member (just something like int(time.time()))?

phylsix commented 5 years ago

Ok, that sounds good. Does lower(STATUS) NOT LIKE '%unlock%' only select those that are on-going? Will workflows that are completed also go into that selection as well?

I am not sure if they would add timestamp automatically for us, if not, yeah, we should add a timestamp member.

dabercro commented 5 years ago

I pinged P&R and it turns out that the best query would actually be:

SELECT count(*) FROM CMS_UNIFIED_ADMIN.WORKFLOW WHERE WM_STATUS LIKE 'running%';

(Replace count(*) with NAME or whatever you need)

dabercro commented 5 years ago

That's coming out to 2760 workflows right now.

phylsix commented 5 years ago

I uploaded the script here: https://github.com/phylsix/WorkflowWebTools/tree/master/workflowmonit

dabercro commented 5 years ago

This is working now: https://its.cern.ch/jira/browse/CMSMONIT-45