Create day-2 operations

arthurdm commented 4 years ago

As a value-add of the OL Operator (versus the Appsody Operator) we should investigate the use of specialized day-2 operations. Here are some examples. The names / kind may change, but there should be enough information below to start prototypes.

ActionLibertyTraceStart

input from user:

pod id (we could have multiple pods if replica is > 1, so need to know which one)
size of PV (we will connect a PV to /logs). Optional, since it could already be bound
traceSpecification
maxFileSize
maxFiles

what the operator does: important NEED TO increase memory of container otherwise will get JVM OOM depending on the trace. maybe this ought to be a user input

oc exec -it

set the trace. example:

 echo '<server><logging  traceSpecification="com.ibm.ws.webcontainer*=all:com.ibm.wsspi.webcontainer*=all:HTTPChannel=all:GenericBNF=all:HTTPDispatcher=all" traceFileName="trace.log" maxFileSize="20" maxFiles="10" traceFormat="BASIC" /></server>' >  /config/configDropins/overrides/<pod>_trace.xml

The customer would then use the app, which will generate trace, and stop it when sufficient trace has been gathered, with the action below:

ActionLibertyTraceStop

input from user:

pod id

what the operator does:

*  oc exec -it <pod>
*  rm /config/configDropins/overrides/<pod>_trace.xml
    *  optionally could tar the files (zip is not found in ubi8)  Files are found inside /logs

ActionLibertyJVMDump

input from user:

pod id
size of PV (we will connect a PV to /opt/ol/wlp/output/defaultServer). Optional, since it could already be bound
dumpType: default is heap, but this can override to be system, or heap,system.

what the operator does:

*  oc exec -it <pod>
*  server dump --include=<dumpType>
*  file is then available at /opt/ol/wlp/output/defaultServer

arturdzm commented 4 years ago

I would suggest having a single Task / TaskRun CRDs inside we could have different definitions. for example readinessProbe in kubernetes can have completely different set of values depending kind of probe

arthurdm commented 4 years ago

There's probably pros / cons both ways.

At first a single LibertyOperation CRD seems attractive because it's simple to use / discover, but as we scale the amount of operations we have (10+) there maybe some operations that are not applicable or available in certain cluster environments, then it would be hard to tell the user the conditions in which the embedded actions are applicable.

Multiple CRDs make it more complex to know which CR to compose, but allows for better specialization and environment-dependent installation / availability.

I think we could perhaps merge the two trace related operations into a single CRD, so we take a hybrid approach here.

kind: LibertyTraceOperation
traceEnabled: true  | false
traceSpecification: ...

arturdzm commented 4 years ago

That was exactly we were discussing with @leochr on Friday one of my proposal was

Kind: LibertyAction
spec:
  podName: 
  otherCommonFields...
  action:
    trace:
        enable: true
        traceSpecification: 
        maxFileSize:
        maxFiles:
    serverDump:
        - heap
        - core
status:

operator-sdk makes it almost impossible to listen to multiple CRDs in single controller, so it would require controller per new CRD. Single CRD is simpler overall solution

The Action/ActionRun however might be useful if we wanted to have automatic action discoverability for tools.

arthurdm commented 4 years ago

I don't recommend we mix stateful actions (such as trace) with stateless / one-time actions (such as server dump). It creates confusion in the usage scenarios.

It's ok to have multiple controllers for these - actually, it becomes even more pluggable in terms of having controllers / CRDs that are environment specific - for example, a CRD that binds to a particular cloud provider (AWS, IBM, Google) storage, etc.

donbourne commented 4 years ago

Another common need for support teams is to get the logs from when the server started. At server startup information about the Liberty and Java versions and any startup-time issues are logged.

Perhaps getting that would be just a matter of getting oc logs for the pod. Could there be an action for that as well?

leochr commented 4 years ago

Thanks, Don, we'll give some thought to the startup scenario.

leochr commented 4 years ago

Update: I've got a prototype for trace operation working. Next, we need to add error handling and report the status of the operation as well as optimize/clean-up the code.

Each day-2 operation will report it's status (started/completed/failed). Such information will be held inside the status field of the CR. Some of that information can be output for oc get openlibertytrace my-app-trace.

All events of the day2 operation will be logged (e.g "Enabled trace for pod abc"... "Stopped trace for pod abc"). Those events will be shown when oc describe openlibertytrace my-app-trace is run.

We can also add annotations/labels into the respective resources that the day2 operation is processing. For example, when the trace is enabled, we can add an annotation to the pod itself to reflect just that (e.g. trace.openliberty.io/status or trace.openliberty.io/enabled). Such information can be used by kAppNav to show the appropriate options (e.g. stop trace) in its console.

arthurdm commented 4 years ago

Thanks for the input @leochr - Once we have the status field and event logging implemented in the prototype it's probably worth posting them here to provide better visualization of the path we're going.

That's an interesting thought about the kAppNav integration. One question for @cvignola is whether the kAppNav dashboard has the ability to drill down into an individual pod for a replicated microservice (e.g.: a microservice portfolio with 3 replica sets) - since the day 2 operations are for a pod, not necessarily the entire replicate set.

donbourne commented 4 years ago

@arthurdm I believe the link that kAppNav generates for Kibana includes all of the pods in the deployment -- but the Kibana dashboards certainly have the ability to narrow in to just see the logs from one pod.

cvignola commented 4 years ago

@arthurdm So yes, kappnav has the ability to show individual pods if you add Pod to the componentKind list. We also have a podlist function we use at Deployment scope, which could be used to populate a pick list.

cvignola commented 4 years ago

@arthurdm @donbourne Yes, the query kAppNav generates form the Kibana dashboard URL for Liberty enumerates all pods belonging to the Deployment. Then like Don said previously, the Kibana dash enables you to move around and narrow your view to a specific pod.

cvignola commented 4 years ago

@arthurdm @leochr @arturdzm As I shared with Arthur via slack, we have an opportunity for industry leadership if we solve a problem facing d2ops. The problem is maturity: d2ops don't have it.

Sorely lacking is the ability for d2ops to be discovered and introspected. It must be possible for higher order tools to be created that raise the abstraction level beyond a yaml interface.

Specifically, a tool (e.g. a UI) should be able to:

1) discover installed d2ops 2) be notified when d2ops are installed/uninstalled 3) be able to know which Kind the d2op applies to 4) be able to initiate a d2op against a specific instance 5) be able to determine the input parameters, types, optionality, and defaults 6) be able to process user-supplied input against provided validation rules 8) be able to specify optional "layout hints" for a UI 9) be able to determine when a d2op has completed 10) be able to know whether a d2op completed successfully or in failure 11) be able to find and access a 'd2op log' if applicable to the d2op to reveal details of its operation

cvignola commented 4 years ago

Toward addressing those requirements ...

1 and 2 are solvable using a controller that queries/listens for CRDs.
3 could be solved via convention: an operator could deploy it's d2op Kinds. e.g.
```
kind: Liberty
spec: 
    d2ops: 
         - kind: LibertyAction
```

4 might need something more generic convention than say, podName - e.g.


kind: LibertyAction
spec: 
      optarget: 
           - name: <instance name>
              kind: <instance kind>

5 could be done at least in part using CRD's OpenAPI spec. But I am not sure OpenAPI provides a way to define defaults and optionality.

6 and 7 would require introduction of additional meta data, which could be added by convention:


kind: LibertyAction
spec: 
     interaction: 
           - parameter:  <parameter name from openapi spec>
              optional: true | false
              default: <default value>
              validation-rule: <regular expression?>
              layout-hint:   <I'm still thinking about this one ...>

8 and 9 would require some convention on the status field - e.g.


kind: LibertyAction
spec: 
status: 
   completion: <time stamp>
   success:  true | false

10 I don't have a good idea for this yet, except maybe make it possible for a d2op that has a job log requirement to have the option of being run as a job.

cvignola commented 4 years ago

@arthurdm @arturdzm @leochr Do any of you guys know where/if CRD OpenAPI is documented? I have found examples of it being used, but no documentation for the schema anywhere.

leochr commented 4 years ago

@cvignola Some information is documented here: https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation and https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.0.md#schemaObject

arthurdm commented 4 years ago

thanks for the feedback @cvignola

For item 3 I propose that we add an annotation to OpenLibertyApplication's CRD which lists the day2operations it supports. Eg:

metadata:
  name: openlibertyapplications.openliberty.io
  annotations:
    day2operations: OpenLibertyTrace, OpenLibertyDump

For item 4 we could similarly add annotations to the CRDs of the actions, this way the user (working with a CR) doesn't have to specify it.
```
metadata:
name: openlibertytraces.openliberty.io
annotations:
targetKinds: Pod
```
For items 5 & 6 we can add these things to the OAS3 Schema of the CRDs.

I believe a good goal to have is: bake as much "tools helper" information / metadata / schema as we can into the CRDs, and keep the CR (for users) short and optimized.

cvignola commented 4 years ago

@arthurdm I concur with your points in https://github.com/OpenLiberty/open-liberty-operator/issues/47#issuecomment-559875437

leochr commented 4 years ago

Delivered dump and trace day-2 operations. Documentation is here (including the operation discovery mechanisms discussed above): https://github.com/OpenLiberty/open-liberty-operator/blob/master/doc/user-guide.md#day-2-operations

OpenLiberty / open-liberty-operator