Closed dabercro closed 6 years ago
Hi Daniel, for the parameters:
Please @vlimant, correct me if I am wrong Clone:
ACDC
Sorry for the delay. I've been traveling, but I should be able to keep up again.
I've reduced the number of options in PR #8, which you guys can check. I went ahead and updated and restarted the website.
I've been thinking about the unwired recoveror and what we need to give it so it can implement the actions.
How do we want to figure out which task to ACDC? Or more specifically, when do we want to make that decision? In the normal Unified recoveror, all that is needed is the workflow name, and it accesses the error codes and decides what actions to take. But the idea is to move all of this upstream, correct? Or is it still up to recoveror to decide which task to ACDC?
Yep, sorry for missing this.
If you want to add an option with either a text field or radio buttons, you can add to these lists or dictionaries for it to show up on the submission page: https://github.com/CMSCompOps/WorkflowWebTools/blob/master/runserver/static/js/addreason.js#L218-L224 (or just tell me what you want, and I can add it.)
I also missed this...
Thanks!
This might be a stupid question, but by tasks that is the same thing as steps, right? For example, when I look at this: https://vocms0113.cern.ch:80/seeworkflow/?workflow=fabozzi_Run2016H-v1-ZeroBiasBunchTrains1-09Nov2016_8023_161109_181412_2642 should I put another Actions option under each different link under Steps with errors?
It is not a stupid question! In this case steps and tasks are the same thing. In WMCore steps are a completely different thing, so, to avoid any kind of misunderstanding, let's replace the name"steps" with "tasks" in our interface.
Hi Paola! When you say "we need to ensure we take decisions for every single task, otherwise there will be missing lumis," what are the possible actions? ACDC, recovery, and anything else?
Where is this going to be enforced? Dan, can you build that into your interface? ie force the operator to choose an action for every task before submitting the request? Or should I double check in the unwired recoveror module that we are doing this?
Paola, what do you think is the best way to do it?
I think the best way to do it is checking the request in the unwired recoveror. All the logic and rules will be there, Dan's interface will be the view of our decision making.
The possible actions would be: 1 - ACDC per task. Then we have:
2 - If we decide to kill and clone, this action will cover all the tasks.
We have another action only for rereco, it creates a recovery workflow that will process all missing lumis for a request given a datatier. I think we'd better not include it for now. We need the ACDCs and clones to work first.
I can certainly force it, but it might be a good idea to check also, if that's easy to do.
Thanks for the list, Paola.
So the parameters are now set for each step for ACDC. The instance on vocms0113 is currently running the setup.
@areinsvo The actions look like this now: https://vocms0113.cern.ch:80/getaction?days=6 The parameters are split with the task name as the key. Let me know how that looks.
@prozober Unfortunately (?) there are no multi-step errors listed at the moment, but feel free to check out any of the workflow views to compare the user interface for Kill & Clone and ACDC.
Hi Dan, we have plenty of requests that need ACDCs in multiple tasks. I will send all actions I can from our interface.
Great! Let me know how you feel about the current interface. If everything looks okay, I will focus on adding more information for decision making.
I guess we should also rethink the "Apply to multiple" list for ACDCs (which right now doesn't allow to change parameters per step).
Resuming the task. I am going to try an action, just to explain myself better, it is not a real action to take. https://vocms0113.cern.ch:80/globalerror
The wf fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch2-09Nov2016_8023_161109_182009_9262 has three tasks with failures:
For example, I would say for every task:
So far so good. Now, several wfs are having the same problem for the task DataProcessing. I would like to go for the same action for those wfs, but only for that task. So, I think we need to include that "Apply to multiple" option for every task we are dealing with.
And here @areinsvo, applying to multiple would let orphan tasks without recovery, hmmm, then I guess we better list here https://vocms0113.cern.ch:80/globalerrorhttps://vocms0113.cern.ch:80/globalerror pending tasks to recover, not workflows.
I don't know if it is possible. If one workflow has an active ACDC, it is put out of the manual-assistance list, @areinsvo you think it is crazy if we handle the state transition to manual-assistance only if all tasks (having failures) have active ACDCs?
@prozober I think I added a lot of the information that you would want to make decisions with the last PR #16
For example (for now), check out https://vocms0113.cern.ch:80/seeworkflow/?workflow=prozober_Run2016B-v2-ZeroBias0-17Jan2017_8020_170201_092625_2960 and let me know how that looks. Not all of the error codes had procedures listed in the WTC twiki, but in this particular example workflow, the error parsing is working well.
That seems to happen when the workflow has been handled already. Thanks for pointing it out. I'll try to fix that error. Were you able to view other workflows on the page? If I click around from the globalerrors page ( https://vocms0113.cern.ch:80/globalerror) everything seems to work fine.
On Mon, Feb 6, 2017 at 2:59 AM, Paola Katherine Rozo < notifications@github.com> wrote:
Thanks, Daniel. I am getting this when I try to access the link you pointed out:
500 Internal Server Error The server encountered an unexpected condition which prevented it from fulfilling the request.
Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/usr/local/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 220, in call self.body = self.oldhandler(*args, *kwargs) File "/usr/local/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 60, in call return self.callable(self.args, **self.kwargs) File "./workflowtools.py", line 206, in seeworkflow get_clustered_group(workflow, self.clusterer, cherrypy.session) File "/home/dabercro/OpsSpace/WorkflowWebTools/clusterworkflows.py", line 237, in get_clustered_group (group[0][0], group[0][1])) IndexError: list index out of range
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CMSCompOps/WorkflowWebTools/issues/6#issuecomment-277608630, or mute the thread https://github.com/notifications/unsubscribe-auth/AHlwhkEqtF4uiDHJrqSOKRLiJPeFJvAJks5rZtLXgaJpZM4J-tip .
@prozober #17 fixes the 500 Error you pointed out. The global errors have consistent colors for error codes between different pie charts too now. Both fixes are already running.
go curious and clicked https://vocms0113.cern.ch:80/seeworkflow/?workflow=fabozzi_Run2016D-23Sep2016-v1-MET-03Feb2017_8026p1_170203_162315_612
got
500 Internal Server Error
The server encountered an unexpected condition which prevented it from fulfilling the request.
Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/usr/local/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 220, in call self.body = self.oldhandler(*args, kwargs) File "/usr/local/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 60, in call return self.callable(self.args, self.kwargs) File "./workflowtools.py", line 223, in seeworkflow classification=main_error_class File "/usr/local/lib/python2.7/site-packages/mako/template.py", line 462, in render return runtime.render(self, self.callable, args, data) File "/usr/local/lib/python2.7/site-packages/mako/runtime.py", line 838, in _render _kwargs_forcallable(callable, data)) File "/usr/local/lib/python2.7/site-packages/mako/runtime.py", line 873, in _render_context _exec_template(inherit, lclcontext, args=args, kwargs=kwargs) File "/usr/local/lib/python2.7/site-packages/mako/runtime.py", line 899, in _exectemplate callable(context, args, kwargs) File "/home/dabercro/OpsSpace/WorkflowWebTools/runserver/templates/mako_modules/workflowtables.html.py", line 75, in render_body __M_writer(unicode(classification[3])) IndexError: tuple index out of range
Hmm, I'm getting a redirect back to the global error page... I'll check the logs and see if I can fix it though.
Sorry, it was a very stupid mistake. Thanks for catching it. Fixed here: #18
A couple questions to make sure my understanding is correct. Other than these points, I think the script is (finally) ready to be used to actually submit ACDCs!
I think the memory is the new memory value, yes. At least, that's what I would have thought. @prozober should answer for sure though.
I am actually reporting 'NotReported' at the moment (as error code -1 with 1 error). So those workflows are showing up with this tool. Again, Paola can answer whether or not that's how she's intending to use it.
Allie, Dan:
As you could notice, the action to be taken includes the assignment of the ACDC, if needed.
If we make a plain ACDC, or a modified ACDC, we need to assign it.
When we clone a workflow or create an ACDC, the required parameters are:
For the assignment the parameters would be:
We haven't discussed the sites where we are going to assign the ACDCs. I have only used two options:
I had a checklist for sites in the past, but I think Jean-Roch said it was a little much. I would definitely agree if we had separate lists for each step...
It'll probably be less cluttered if I do a drop down list though and allow adding, like for the reasons. Does that sound reasonable? About how many sites at a time do you think you'd use? (That will help me guess what the best way to display it would be.)
That's the way I am doing it today. I select the sites for every task, look at this mess: https://cmst2.web.cern.ch/cmst2/unified/report/pdmvserv_task_JME-PhaseIFall16GS-00003__v1_T_170127_092523_4740 My actions would be
I can select as many sites as I want. But, it would be useful if by default the sites from the ACDC documents would be selected. What do you think? Too messy?
Okay, I added the site list back in for each individual step. To make it easier to read, I also made the sites listed in the recovery docs bold faced. (In order for it to look nice, I needed to reload the page to get the new style sheet. Depends on your browser...) #21
@areinsvo Watch out for the site list. If only one site is checked, the parameter returned is a string. If more than one is checked, it's a list. At least, that's my impression after a quick test. Check the variable type first.
Dan, look at this https://vocms0113.cern.ch:80/seeworkflow/?workflow=pdmvserv_task_SMP-RunIISummer15wmLHEGS-00086__v1_T_170204_181432_946 AttributeError: 'WorkflowInfo' object has no attribute 'get_workflow_params'
Fixed. I'm sorry about that. Thanks for catching it. I need to sit down and write unit tests for these pages.
Dan, when I try to https://vocms0113.cern.ch:80/seeworkflow?workflow=pdmvserv_task_EXO-RunIISummer15wmLHEGS-04239__v1_T_170114_004935_4431 I got: SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:590)
Okay, I fixed that, sorry. I'll need to come up with a way to automatically regenerate proxies.
Dan, all Unified monitoring pages/json were moved, now we need to point at https://vocms049.cern.ch/unified/
@vlimant When I try to access the .json files on vocms049, I get a 403 (forbidden) error. I'm guessing your directories /var/www/html and /var/www/html/unified are not +x for the apache user. They are owned by root:zh and vlimant:zh, respectively, both with permissions flag 774.
Okay, I have the SSO cookies working with #25. Sorry that took so long. all_errors.json is empty at the moment, but the site is back up.
Hi @areinsvo, I think we have the perfect workflow to take some actions with our tool. https://vocms0113.cern.ch:80/seeworkflow/?workflow=pdmvserv_task_EXO-RunIISummer15GS-03460__v1_T_170223_220648_3124 It contains reading problems, submit failures and unreported issues at T0, Florida and RWTH. The only choice we got is to try a plain ACDC round. I already sent the action https://vocms0113.cern.ch:80/getaction.
Thanks Paola. I think I just submitted the ACDCs on the two tasks in EXO-RunIISummer15GS-03460: areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170228_185246_6339 areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170228_185238_4716
Please let me know if something isn't right here.
Still on my to do list: 1) Implement xrootd setting ( @prozober or @vlimant, what setting do I have to change in the schema for this? I can find it if I look harder, but if you know it that's probably easier) 2) Check for unreported errors on my end. If there are unreported errors for a task, do not change splitting or memory setting (ie "plain ACDC")
It was missing last week because I changed and broke something. At the moment, 'NotReported' is back to error code -1 as I commented two weeks ago. Still not ideal, I think...
I asked @prozober if I also need to check for unreported errors in recoveror, and I took her response to mean yes, so that recoveror doesn't try to do anything illegal/dumb if there are unreported errors present. She can correct me if I misunderstood.
The other option would be to leave it up to the WTC to submit the right parameters in the case of unreported errors, and have recoveror not worry about it.
I am sorry I did not make myself clear. The WTC must decide what action to take for every unreported error. But of course, Unified must be aware of the unreported errors for every task.
The ACDCs look OK. Now, for the assignment:
Daniel, we need to include another parameter besides xrootd, secondary_xrootd. I usually do not modify that value, but it is better to take it into account. secondary_xrootd will behave as xrootd, but @areinsvo the parameter we need to change in the schema would be "TrustPUSitelists".
Thanks Paola! For this particular workflow, you didn't put anything in xrootd, so xrootd just wasn't present in the action JSON (so I don't change anything). I think that works fine. I'll edit the code to properly change TrustSitelists if the xrootd parameter is set.
I guess this shows my lack of knowledge about the details - how are ACDCs normally assigned? Do we want the same modified recoveror script to automatically create and assign the ACDCs? (I thought the script was already doing this, but it sets the status to assignment-approved, not assigned)
the current recoveror does both create and assign indeed, and I think we need to have it this way
Ah, I see that now. I didn't copy that part of the code over correctly for some reason. I'll fix the code ASAP. In the meantime, should we force the assignment of the two ACDCs I created earlier but didn't assign? Or delete those ACDCs and create them from scratch and assign them later today when my script is fixed?
you can reject the two acdc and remake-assign on that workflow
I rejected the two ACDCs and the script is ready to be tested, except for one thing: once the new ACDCs are created and assigned, how do I change the status of the workflow from assistance-manual to assistance-recovering? The normal recoveror uses assignSession, but that's what I'm trying to decouple from in this script (since this will be run manually, not as part of the normal Unified cycle). @vlimant?
If AC/DC are found checkor will change the status accordingly. The synchronization of action setting and running it needs to be thought through. If all is ran under 049, then it can stay with changing the status in the db directly, I don't see issue with that.
What do you mean by "it can stay with changing the status in the db directly"? You mean the new recoveror (it really needs a new name...) can/should change the status directly?
For the test today, is it okay to just let checkor handle changing the status after I create and assign the ACDCs?
@prozober I'm catching up now. To answer your earlier question about unassigned xrootd, if you don't assign a value, it won't show up on the parameters at all. The best thing for @areinsvo to do in that case is something like:
use_xrootd = response[prepID]['Parameters'].get('xrootd')
if use_xrootd is None:
# Get the default value of the xrootd
use_xrootd = some_fuction_call(parameters to get recovery docs)
# More pythonic/less clear would be: use_xrootd = response[prepID]['Parameters'].get('xrootd', some_fuction_call(params))
if use_xrootd == 'enabled':
# Using enabled option
elif use_xrootd == 'disabled':
# Using disabled option
else:
# Error handling
I also added the secondary option. I'll push at least that to the server if I don't get anything else working by the end of today.
@vlimant @prozober @areinsvo @mcremone
Since I don't really know everything the workflow team or Unified needs, feel free to make any comments or pull requests. We can also track the testing progress here.
From Jean-Roch:
although we have a unified wired to testbed
https://cmst2.web.cern.ch/cmst2/unified-testbed/
it might be simpler to have it wired to production and run this in "commissioning mode".
I have the feeling, looking at the example actions, that there is too many parameters passed down. Many of the them should not be needed for recovery & clone (proc version, sites, lfn, ...) since all these can be taken from the original workflow and such. let's tune this to what is actually needed.
We should modify the recoveror module to read the action json, and be able to operate it by hand. The way I see it for a fast integration is:
Viewing Actions
In order to view these actions from inside the CERN network, one can look at https://vocms0113.cern.ch:80/getaction. This shows actions submitted today. You can also pass a parameter "days" to look farther back. For example looking at https://vocms0113.cern.ch:80/getaction?days=20 will show some old testing actions.
Changing Parameters
To make it easier for everyone to track and comment on parameters, they are generated with these variables here: https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/static/js/addreason.js#L145-L184 The variable params results in a "decrease" "same" "increase" table, texts and bools are just text and "true/false" fields, and the opts variable results in more general radio buttons.
The site list is generated here: https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/templates/workflowtables.html#L19-L27 the form field is then made here https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/static/js/addreason.js#L231-L236