Open jstrachan opened 8 years ago
as discussed over IRC -
Just trying to brainstorm the env and runtime we can use for this:
Env: fabric8 - with elasticsearch with Grafana , logstash for grokking
we could start with a simple project - that has its repo on github, the app gets deployed on fabric8, and when people start using it they find some exceptions on the logs -- then we grok it and raise the ticket on the project in github
Couple of queries to start with :
right now our fabric8 platform does not have option to setup centralised logs, how to do that ?
Do we first need to have some mechanism like helm charts or fabric8-devops module that will setup a centralised logging ? ( one thing i got from my customers is that they have big time in setting the centralised logging - so we should so something on that line if we don't have one )
More queries to come .. ;)
@kameshsampath fabric8 has a package called logging
which you can run via the Run...
button on any of the Runtime
views that installs elasticsearch, kibana and fluentd to capture all logs of all docker containers into elasticsearch; then pipelines will also capture build, deploy and approve events too.
There's also a helm chart of the same thing too if you prefer: http://fabric8.io/helm/
or if you want to use the kubectl
or oc
tools you can apply the yaml/json directly too via apply
:
http://fabric8.io/manifests/kubernetes.html
http://fabric8.io/manifests/openshift.html
The actual versions + kubernetes/openshift yaml/json and helm charts are all here: http://central.maven.org/maven2/io/fabric8/devops/packages/logging/
So once you have elasticsearch running, you can then start writing a little microservice to browse the logs in Elasticsearch looking for patterns (e.g. java stack traces or the text ".Exception" or whatever). You could even use Elasticsearch to checkpoint where you are in the elasticsearch database so if your pod restarts, it can query on startup where it got to; so if you have peta-bytes of data, you checkpoint every few Gb in case your pod is restarted
@jstrachan - thanks, i just did that - though i did not deploy the package logging
, deployed the ES and fluentd from fabric8-devops to my fabric8 env. Though i did not do Kibana - let me do that as well as a package
Right now I am trying to use my existing application to send logs to fluentd
, confused on do we need to add those log-driver appenders in our app ?
As a first step: i am planning to ensure that my app logs get into elastic search - though right now i don't have peta-bytes ;) .. some KBs/MBs for now - will get back to this checpoint stuff once we have our basic microservice working and query and getting us the results.
Next I will develop the microservice that can query the ES to get docs for some pattern we define and add make the microservice configurable via ConfigMap.
Since you guys are in F2F not sure you will be on IRC, in case we want to discuss something, but for now let me get the above mentioned things working.
I'm not at the Fuse F2F so will be on IRC ;)
There is no need to change your app to have its logs captured - to be a good logging citizen on kubernetes just log to stdout - that's it. Fluentd then captures all logs for all containers on each host (fluentd runs as a DaemonSet so there is a fluentd pod per host)
So your microservice should be able to monitor the logs of all fabric8 microservices straight away! ;)
ok great! thats something i want to get clarified .. btw is our latest f8-devops snapshots in sonatype ?? I get some issue while building our kibana devops .. lets discuss more when you are in IRC
I'd just run logging
via the Run...
menu in the fabric8 console TBH; then it'll use the last released package
created this repo https://github.com/kameshsampath/fabric8-bug-hunter which has the PoC code.
multiline
hence Java stack traces are shown as multiple docs during Query
Ref:
https://bugzilla.redhat.com/show_bug.cgi?id=1294168
https://github.com/openshift/origin-aggregated-logging/issues/28A simple approach to processing the exceptions is to collect them and store them in a separate Errors index of elasticsearch and also checkpoint periodically where bug-hunter gets to (timestamps) in the log index; so that on restart it can start from its last checkpoint and keep processing. We need to make sure that bug-hunter can handle re-processing the same data without generating new errors for the same logs though (so maybe ensuring we add the _id
value for each exception into the Error object; so that we don't raise false positives.
To make it easier to visualise things over time we maybe want to use the time that the error happened as part of the key of the index; e.g. so we could have a daily Errors index which counts how many of each kind of errors occur (along with a link to the actual error log _id
) for nice reporting.
ok @jstrachan - lets push the info back on a to es with our custom json format (built around our model). wondering if we can reuse any thing from collector-utils... will check for it
@jstrachan - can you please review this https://github.com/kameshsampath/fabric8-bug-hunter when you have time.
SCM Revision
, SCM Branch
added , with place holders for issue-tracker-url
, project-url
etc.,WIP
bughunter
@jstrachan - check this new WIP branch - https://github.com/kameshsampath/fabric8-bug-hunter/tree/more-rx, i have moved everything to vert.x service based approach - now this service is available over vert.x event bus - so applications that might need them can reuse.
This now has extracted data successfully, how will be saving them to bughunter index. Please ping me in IRC to discuss further
Will improve it further for a SPI based approach later.
@jstrachan - saving of the extracted data to bughunter
index is done
https://gist.github.com/kameshsampath/62c011ec70e5ec844b3992277e323036
the master https://github.com/kameshsampath/fabric8-bug-hunter has the latest changes.
WIP
@kameshsampath have you looked at fluent-plugin-detect-exceptions? Tagging exceptions (or annotating with some other metadata) as bugs might make finding them in ElasticSearch easier.
So this looks awesome http://squash.io
Turns exceptions in apps into issues; using git blame to try figure out who broke it.
Though I wonder if rather than adding client libraries to all apps; we just analyse the centralised logs instead to find exceptions? Then the same exception in many logs over many pods can be associated with the same underlying issue?
Also rather than making a custom workflow UI, could we just raise issues?
Hopefully we can start associating versions/deployments with issues fixed so that if an exception comes back we could reopen an old log?