Testing Tool and Adjusting Parameters

dabercro commented 7 years ago

@vlimant @prozober @areinsvo @mcremone

Since I don't really know everything the workflow team or Unified needs, feel free to make any comments or pull requests. We can also track the testing progress here.

From Jean-Roch:

although we have a unified wired to testbed

https://cmst2.web.cern.ch/cmst2/unified-testbed/

it might be simpler to have it wired to production and run this in "commissioning mode".

I have the feeling, looking at the example actions, that there is too many parameters passed down. Many of the them should not be needed for recovery & clone (proc version, sites, lfn, ...) since all these can be taken from the original workflow and such. let's tune this to what is actually needed.

We should modify the recoveror module to read the action json, and be able to operate it by hand. The way I see it for a fast integration is:

one inputs a couple of actual actions to be taken.
one runs the recoveror command line to feed on this action, in test mode, convince itself that it's fine, and does apply it.
Viewing Actions

In order to view these actions from inside the CERN network, one can look at https://vocms0113.cern.ch:80/getaction. This shows actions submitted today. You can also pass a parameter "days" to look farther back. For example looking at https://vocms0113.cern.ch:80/getaction?days=20 will show some old testing actions.

Changing Parameters

To make it easier for everyone to track and comment on parameters, they are generated with these variables here: https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/static/js/addreason.js#L145-L184 The variable params results in a "decrease" "same" "increase" table, texts and bools are just text and "true/false" fields, and the opts variable results in more general radio buttons.

The site list is generated here: https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/templates/workflowtables.html#L19-L27 the form field is then made here https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/static/js/addreason.js#L231-L236

paorozo commented 7 years ago

Hi Daniel, for the parameters:

Please @vlimant, correct me if I am wrong Clone:

params
- ‘splitting’ : numerical value, must be read from original request, the WTC can change it
- ‘memory’: numerical value, must be read from original request, the WTC can change it
- ‘timeout’: numerical value, integer value, must be read from original request, the WTC can change it
- ‘invalidate’: please, remove this parameter
- ‘group’: usually DATAOPS, but can be changed if needed
- ‘max_memory’: numerical value, must be read from original request, the WTC can change it

ACDC

params
- ‘memory’: numerical value, must be read from original request, the WTC can change it
- ‘timeouts’: numerical value, must be read from original request, the WTC can change it
- next ones, would be better do not deal with them, they must not be changed
- ‘replica’: must be read from original request, we do not want to change it
- ‘trustsite’: must be read from original request, we must not change it
- ‘trustPUlists’ (NEW): must be read from original request, we must not change it
- ‘LFN’: must be read from original request, we must not change it
- ‘ERA’: must be read from original request, we must not change it
- ‘procstring':must be read from original request, we must not change it
- ‘procversion’: must be read from original request, we must not change it
opts Activity: 'reprocessing', 'production', ‘test’: I do not understand this option

dabercro commented 7 years ago

Sorry for the delay. I've been traveling, but I should be able to keep up again.

I've reduced the number of options in PR #8, which you guys can check. I went ahead and updated and restarted the website.

areinsvo commented 7 years ago

I've been thinking about the unwired recoveror and what we need to give it so it can implement the actions.

How do we want to figure out which task to ACDC? Or more specifically, when do we want to make that decision? In the normal Unified recoveror, all that is needed is the workflow name, and it accesses the error codes and decides what actions to take. But the idea is to move all of this upstream, correct? Or is it still up to recoveror to decide which task to ACDC?

dabercro commented 7 years ago

Yep, sorry for missing this.

If you want to add an option with either a text field or radio buttons, you can add to these lists or dictionaries for it to show up on the submission page: https://github.com/CMSCompOps/WorkflowWebTools/blob/master/runserver/static/js/addreason.js#L218-L224 (or just tell me what you want, and I can add it.)

paorozo commented 7 years ago

I also missed this...

Decisions need to be taken by task. The task's name should be included in the jsons we are sending.
Yes Dan, please add the option in the form. It needs to be a drop-down list, with the set of fixed tasks, to avoid stupid mistakes.
Also, we need to ensure we take decisions for every single task, otherwise there will be missing lumis.

Thanks!

dabercro commented 7 years ago

This might be a stupid question, but by tasks that is the same thing as steps, right? For example, when I look at this: https://vocms0113.cern.ch:80/seeworkflow/?workflow=fabozzi_Run2016H-v1-ZeroBiasBunchTrains1-09Nov2016_8023_161109_181412_2642 should I put another Actions option under each different link under Steps with errors?

paorozo commented 7 years ago

It is not a stupid question! In this case steps and tasks are the same thing. In WMCore steps are a completely different thing, so, to avoid any kind of misunderstanding, let's replace the name"steps" with "tasks" in our interface.

areinsvo commented 7 years ago

Hi Paola! When you say "we need to ensure we take decisions for every single task, otherwise there will be missing lumis," what are the possible actions? ACDC, recovery, and anything else?

Where is this going to be enforced? Dan, can you build that into your interface? ie force the operator to choose an action for every task before submitting the request? Or should I double check in the unwired recoveror module that we are doing this?

Paola, what do you think is the best way to do it?

paorozo commented 7 years ago

I think the best way to do it is checking the request in the unwired recoveror. All the logic and rules will be there, Dan's interface will be the view of our decision making.

The possible actions would be: 1 - ACDC per task. Then we have:

Plain ACDC.
ACDC with bigger memory.
ACDC with finer splitting.
ACDC with xrootd enabled (Dan, this is new)
And of course, all possible combinations (e.g ACDC with bigger memory, finer splitting and xrootd enabled at the same time).

2 - If we decide to kill and clone, this action will cover all the tasks.

We have another action only for rereco, it creates a recovery workflow that will process all missing lumis for a request given a datatier. I think we'd better not include it for now. We need the ACDCs and clones to work first.

dabercro commented 7 years ago

I can certainly force it, but it might be a good idea to check also, if that's easy to do.

Thanks for the list, Paola.

dabercro commented 7 years ago

So the parameters are now set for each step for ACDC. The instance on vocms0113 is currently running the setup.

@areinsvo The actions look like this now: https://vocms0113.cern.ch:80/getaction?days=6 The parameters are split with the task name as the key. Let me know how that looks.

@prozober Unfortunately (?) there are no multi-step errors listed at the moment, but feel free to check out any of the workflow views to compare the user interface for Kill & Clone and ACDC.

paorozo commented 7 years ago

Hi Dan, we have plenty of requests that need ACDCs in multiple tasks. I will send all actions I can from our interface.

dabercro commented 7 years ago

Great! Let me know how you feel about the current interface. If everything looks okay, I will focus on adding more information for decision making.

I guess we should also rethink the "Apply to multiple" list for ACDCs (which right now doesn't allow to change parameters per step).

paorozo commented 7 years ago

Resuming the task. I am going to try an action, just to explain myself better, it is not a real action to take. https://vocms0113.cern.ch:80/globalerror

The wf fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch2-09Nov2016_8023_161109_182009_9262 has three tasks with failures:

DataProcessing.
DataProcessing/DataProcessingMergeALCARECOStreamLumiPixelsMinBias
DataProcessing/DataProcessingMergeMINIAODoutput

https://vocms0113.cern.ch:80/seeworkflow/?workflow=fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch2-09Nov2016_8023_161109_182009_9262

For example, I would say for every task:

ACDC with bigger memory and finer splitting.
Plain ACDC
Plain ACDC

So far so good. Now, several wfs are having the same problem for the task DataProcessing. I would like to go for the same action for those wfs, but only for that task. So, I think we need to include that "Apply to multiple" option for every task we are dealing with.

And here @areinsvo, applying to multiple would let orphan tasks without recovery, hmmm, then I guess we better list here https://vocms0113.cern.ch:80/globalerrorhttps://vocms0113.cern.ch:80/globalerror pending tasks to recover, not workflows.

I don't know if it is possible. If one workflow has an active ACDC, it is put out of the manual-assistance list, @areinsvo you think it is crazy if we handle the state transition to manual-assistance only if all tasks (having failures) have active ACDCs?

dabercro commented 7 years ago

@prozober I think I added a lot of the information that you would want to make decisions with the last PR #16

For example (for now), check out https://vocms0113.cern.ch:80/seeworkflow/?workflow=prozober_Run2016B-v2-ZeroBias0-17Jan2017_8020_170201_092625_2960 and let me know how that looks. Not all of the error codes had procedures listed in the WTC twiki, but in this particular example workflow, the error parsing is working well.

dabercro commented 7 years ago

That seems to happen when the workflow has been handled already. Thanks for pointing it out. I'll try to fix that error. Were you able to view other workflows on the page? If I click around from the globalerrors page ( https://vocms0113.cern.ch:80/globalerror) everything seems to work fine.

On Mon, Feb 6, 2017 at 2:59 AM, Paola Katherine Rozo < notifications@github.com> wrote:

Thanks, Daniel. I am getting this when I try to access the link you pointed out:

500 Internal Server Error The server encountered an unexpected condition which prevented it from fulfilling the request.

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/usr/local/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 220, in call self.body = self.oldhandler(*args, *kwargs) File "/usr/local/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 60, in call return self.callable(self.args, **self.kwargs) File "./workflowtools.py", line 206, in seeworkflow get_clustered_group(workflow, self.clusterer, cherrypy.session) File "/home/dabercro/OpsSpace/WorkflowWebTools/clusterworkflows.py", line 237, in get_clustered_group (group[0][0], group[0][1])) IndexError: list index out of range

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CMSCompOps/WorkflowWebTools/issues/6#issuecomment-277608630, or mute the thread https://github.com/notifications/unsubscribe-auth/AHlwhkEqtF4uiDHJrqSOKRLiJPeFJvAJks5rZtLXgaJpZM4J-tip .

dabercro commented 7 years ago

@prozober #17 fixes the 500 Error you pointed out. The global errors have consistent colors for error codes between different pie charts too now. Both fixes are already running.

vlimant commented 7 years ago

go curious and clicked https://vocms0113.cern.ch:80/seeworkflow/?workflow=fabozzi_Run2016D-23Sep2016-v1-MET-03Feb2017_8026p1_170203_162315_612

got

500 Internal Server Error

The server encountered an unexpected condition which prevented it from fulfilling the request.

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/usr/local/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 220, in call self.body = self.oldhandler(*args, kwargs) File "/usr/local/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 60, in call return self.callable(self.args, self.kwargs) File "./workflowtools.py", line 223, in seeworkflow classification=main_error_class File "/usr/local/lib/python2.7/site-packages/mako/template.py", line 462, in render return runtime.render(self, self.callable, args, data) File "/usr/local/lib/python2.7/site-packages/mako/runtime.py", line 838, in _render _kwargs_forcallable(callable, data)) File "/usr/local/lib/python2.7/site-packages/mako/runtime.py", line 873, in _render_context _exec_template(inherit, lclcontext, args=args, kwargs=kwargs) File "/usr/local/lib/python2.7/site-packages/mako/runtime.py", line 899, in _exectemplate callable(context, args, kwargs) File "/home/dabercro/OpsSpace/WorkflowWebTools/runserver/templates/mako_modules/workflowtables.html.py", line 75, in render_body __M_writer(unicode(classification[3])) IndexError: tuple index out of range

dabercro commented 7 years ago

Hmm, I'm getting a redirect back to the global error page... I'll check the logs and see if I can fix it though.

dabercro commented 7 years ago

Sorry, it was a very stupid mistake. Thanks for catching it. Fixed here: #18

areinsvo commented 7 years ago

A couple questions to make sure my understanding is correct. Other than these points, I think the script is (finally) ready to be used to actually submit ACDCs!

When a value is input for memory, that's the new memory value I should use for the ACDC, right? The other option is to put in a factor or amount to increase the memory by.
@vlimant , I'm going to use getWmErr() to double check that we are creating an ACDC for every step. WmErr includes LogCollect tasks though. Can these be safely ignored?
Do I need to check for unreported errors in this script? If there are unreported erros, what action needs to be taken?

dabercro commented 7 years ago

I think the memory is the new memory value, yes. At least, that's what I would have thought. @prozober should answer for sure though.

I am actually reporting 'NotReported' at the moment (as error code -1 with 1 error). So those workflows are showing up with this tool. Again, Paola can answer whether or not that's how she's intending to use it.

paorozo commented 7 years ago

Allie, Dan:

The provided input will be the new memory value. We need to provide the exact value we want to try.
LogCollect must be ignored in every case.
We need to check for unreported errors. We can take two actions: Plain ACDC, and ACDC over xrootd assigned to a different site from the sites included in the ACDC document. Notice that if no values are provided, we need to get that info from the OriginalRequest.

paorozo commented 7 years ago

As you could notice, the action to be taken includes the assignment of the ACDC, if needed.

When we kill and clone a workflow, we ignore the assignment.
If we make a plain ACDC, or a modified ACDC, we need to assign it.

When we clone a workflow or create an ACDC, the required parameters are:
- Memory
- Splitting

For the assignment the parameters would be:

xrootd
sites

We haven't discussed the sites where we are going to assign the ACDCs. I have only used two options:

-s ACDC (https://github.com/CMSCompOps/WmAgentScripts/blob/master/assign.py#L307), picks up the sites from the ACDC documents.
the option above does not work if we have a merge task failing at a site that went to drain. In that case, we assign the ACDC to a set of sites, enabling the xrootd option.
We can use a drop-down list for selecting the sites we want to use. @dabercro do you have another idea?

dabercro commented 7 years ago

I had a checklist for sites in the past, but I think Jean-Roch said it was a little much. I would definitely agree if we had separate lists for each step...

It'll probably be less cluttered if I do a drop down list though and allow adding, like for the reasons. Does that sound reasonable? About how many sites at a time do you think you'd use? (That will help me guess what the best way to display it would be.)

paorozo commented 7 years ago

That's the way I am doing it today. I select the sites for every task, look at this mess: https://cmst2.web.cern.ch/cmst2/unified/report/pdmvserv_task_JME-PhaseIFall16GS-00003__v1_T_170127_092523_4740 My actions would be

Task JME-PhaseIFall16GS-00003_0MergeRAWSIMoutput: plain ACDC, and assign to sites from the ACDC documents.
Task JME-PhaseIFall16DR-00002_0: ACDC with finer splitting, and assign to sites from the ACDC documents over xrootd.
Task JME-PhaseIFall16DR-00002_1: ACDC with bigger memory, and assign to sites from the ACDC documents sites but T1_IT_CNAF, over xrootd.
Task JME-PhaseIFall16MiniAOD-00002_0MergeMINIAODSIMoutput: plain ACDC, I do not trust in Brunel, so I will assign to FNAL and CERN over xrootd.

I can select as many sites as I want. But, it would be useful if by default the sites from the ACDC documents would be selected. What do you think? Too messy?

dabercro commented 7 years ago

Okay, I added the site list back in for each individual step. To make it easier to read, I also made the sites listed in the recovery docs bold faced. (In order for it to look nice, I needed to reload the page to get the new style sheet. Depends on your browser...) #21

@areinsvo Watch out for the site list. If only one site is checked, the parameter returned is a string. If more than one is checked, it's a list. At least, that's my impression after a quick test. Check the variable type first.

paorozo commented 7 years ago

Dan, look at this https://vocms0113.cern.ch:80/seeworkflow/?workflow=pdmvserv_task_SMP-RunIISummer15wmLHEGS-00086__v1_T_170204_181432_946 AttributeError: 'WorkflowInfo' object has no attribute 'get_workflow_params'

dabercro commented 7 years ago

Fixed. I'm sorry about that. Thanks for catching it. I need to sit down and write unit tests for these pages.

paorozo commented 7 years ago

Dan, when I try to https://vocms0113.cern.ch:80/seeworkflow?workflow=pdmvserv_task_EXO-RunIISummer15wmLHEGS-04239__v1_T_170114_004935_4431 I got: SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:590)

dabercro commented 7 years ago

Okay, I fixed that, sorry. I'll need to come up with a way to automatically regenerate proxies.

paorozo commented 7 years ago

Dan, all Unified monitoring pages/json were moved, now we need to point at https://vocms049.cern.ch/unified/

dabercro commented 7 years ago

@vlimant When I try to access the .json files on vocms049, I get a 403 (forbidden) error. I'm guessing your directories /var/www/html and /var/www/html/unified are not +x for the apache user. They are owned by root:zh and vlimant:zh, respectively, both with permissions flag 774.

dabercro commented 7 years ago

Okay, I have the SSO cookies working with #25. Sorry that took so long. all_errors.json is empty at the moment, but the site is back up.

paorozo commented 7 years ago

Hi @areinsvo, I think we have the perfect workflow to take some actions with our tool. https://vocms0113.cern.ch:80/seeworkflow/?workflow=pdmvserv_task_EXO-RunIISummer15GS-03460__v1_T_170223_220648_3124 It contains reading problems, submit failures and unreported issues at T0, Florida and RWTH. The only choice we got is to try a plain ACDC round. I already sent the action https://vocms0113.cern.ch:80/getaction.

areinsvo commented 7 years ago

Thanks Paola. I think I just submitted the ACDCs on the two tasks in EXO-RunIISummer15GS-03460: areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170228_185246_6339 areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170228_185238_4716

Please let me know if something isn't right here.

Still on my to do list: 1) Implement xrootd setting ( @prozober or @vlimant, what setting do I have to change in the schema for this? I can find it if I look harder, but if you know it that's probably easier) 2) Check for unreported errors on my end. If there are unreported errors for a task, do not change splitting or memory setting (ie "plain ACDC")

vlimant commented 7 years ago

are we going down the road of not having the unreported error displayed in the globalerror page and done under-the-hood by recoveror ? if yes, you see me frowning ...

dabercro commented 7 years ago

It was missing last week because I changed and broke something. At the moment, 'NotReported' is back to error code -1 as I commented two weeks ago. Still not ideal, I think...

areinsvo commented 7 years ago

I asked @prozober if I also need to check for unreported errors in recoveror, and I took her response to mean yes, so that recoveror doesn't try to do anything illegal/dumb if there are unreported errors present. She can correct me if I misunderstood.

The other option would be to leave it up to the WTC to submit the right parameters in the case of unreported errors, and have recoveror not worry about it.

paorozo commented 7 years ago

I am sorry I did not make myself clear. The WTC must decide what action to take for every unreported error. But of course, Unified must be aware of the unreported errors for every task.

paorozo commented 7 years ago

The ACDCs look OK. Now, for the assignment:

In this case, I did not provide the xrootd value. We need to take it from the original request. @dabercro the action we send assigns something like xrootd=NULL, if I don't fill that field?
The parameter we need to change in the schema is "TrustSitelists".

Daniel, we need to include another parameter besides xrootd, secondary_xrootd. I usually do not modify that value, but it is better to take it into account. secondary_xrootd will behave as xrootd, but @areinsvo the parameter we need to change in the schema would be "TrustPUSitelists".

areinsvo commented 7 years ago

Thanks Paola! For this particular workflow, you didn't put anything in xrootd, so xrootd just wasn't present in the action JSON (so I don't change anything). I think that works fine. I'll edit the code to properly change TrustSitelists if the xrootd parameter is set.

areinsvo commented 7 years ago

I guess this shows my lack of knowledge about the details - how are ACDCs normally assigned? Do we want the same modified recoveror script to automatically create and assign the ACDCs? (I thought the script was already doing this, but it sets the status to assignment-approved, not assigned)

vlimant commented 7 years ago

the current recoveror does both create and assign indeed, and I think we need to have it this way

areinsvo commented 7 years ago

Ah, I see that now. I didn't copy that part of the code over correctly for some reason. I'll fix the code ASAP. In the meantime, should we force the assignment of the two ACDCs I created earlier but didn't assign? Or delete those ACDCs and create them from scratch and assign them later today when my script is fixed?

vlimant commented 7 years ago

you can reject the two acdc and remake-assign on that workflow

areinsvo commented 7 years ago

I rejected the two ACDCs and the script is ready to be tested, except for one thing: once the new ACDCs are created and assigned, how do I change the status of the workflow from assistance-manual to assistance-recovering? The normal recoveror uses assignSession, but that's what I'm trying to decouple from in this script (since this will be run manually, not as part of the normal Unified cycle). @vlimant?

vlimant commented 7 years ago

If AC/DC are found checkor will change the status accordingly. The synchronization of action setting and running it needs to be thought through. If all is ran under 049, then it can stay with changing the status in the db directly, I don't see issue with that.

areinsvo commented 7 years ago

What do you mean by "it can stay with changing the status in the db directly"? You mean the new recoveror (it really needs a new name...) can/should change the status directly?

For the test today, is it okay to just let checkor handle changing the status after I create and assign the ACDCs?

dabercro commented 7 years ago

@prozober I'm catching up now. To answer your earlier question about unassigned xrootd, if you don't assign a value, it won't show up on the parameters at all. The best thing for @areinsvo to do in that case is something like:

use_xrootd = response[prepID]['Parameters'].get('xrootd')
if use_xrootd is None:
    # Get the default value of the xrootd
    use_xrootd = some_fuction_call(parameters to get recovery docs)

# More pythonic/less clear would be: use_xrootd = response[prepID]['Parameters'].get('xrootd', some_fuction_call(params))

if use_xrootd == 'enabled':
    # Using enabled option
elif use_xrootd == 'disabled':
    # Using disabled option
else:
    # Error handling

I also added the secondary option. I'll push at least that to the server if I don't get anything else working by the end of today.

CMSCompOps / WorkflowWebTools

Testing Tool and Adjusting Parameters #6

From Jean-Roch:

Viewing Actions

Changing Parameters