jstrachan commented 8 years ago

Turns exceptions in apps into issues; using git blame to try figure out who broke it.

Though I wonder if rather than adding client libraries to all apps; we just analyse the centralised logs instead to find exceptions? Then the same exception in many logs over many pods can be associated with the same underlying issue?

Also rather than making a custom workflow UI, could we just raise issues?

Hopefully we can start associating versions/deployments with issues fixed so that if an exception comes back we could reopen an old log?

kameshsampath commented 7 years ago

as discussed over IRC -

we can make the raising issues optional, trigger only alerts
way to query stack traces <--> issues

kameshsampath commented 7 years ago

Just trying to brainstorm the env and runtime we can use for this:

Env: fabric8 - with elasticsearch with Grafana , logstash for grokking

we could start with a simple project - that has its repo on github, the app gets deployed on fabric8, and when people start using it they find some exceptions on the logs -- then we grok it and raise the ticket on the project in github

Couple of queries to start with :

right now our fabric8 platform does not have option to setup centralised logs, how to do that ?
Do we first need to have some mechanism like helm charts or fabric8-devops module that will setup a centralised logging ? ( one thing i got from my customers is that they have big time in setting the centralised logging - so we should so something on that line if we don't have one )

More queries to come .. ;)

jstrachan commented 7 years ago

@kameshsampath fabric8 has a package called logging which you can run via the Run... button on any of the Runtime views that installs elasticsearch, kibana and fluentd to capture all logs of all docker containers into elasticsearch; then pipelines will also capture build, deploy and approve events too.

There's also a helm chart of the same thing too if you prefer: http://fabric8.io/helm/

or if you want to use the kubectl or oc tools you can apply the yaml/json directly too via apply: http://fabric8.io/manifests/kubernetes.html http://fabric8.io/manifests/openshift.html

The actual versions + kubernetes/openshift yaml/json and helm charts are all here: http://central.maven.org/maven2/io/fabric8/devops/packages/logging/

So once you have elasticsearch running, you can then start writing a little microservice to browse the logs in Elasticsearch looking for patterns (e.g. java stack traces or the text ".Exception" or whatever). You could even use Elasticsearch to checkpoint where you are in the elasticsearch database so if your pod restarts, it can query on startup where it got to; so if you have peta-bytes of data, you checkpoint every few Gb in case your pod is restarted

kameshsampath commented 7 years ago

@jstrachan - thanks, i just did that - though i did not deploy the package logging, deployed the ES and fluentd from fabric8-devops to my fabric8 env. Though i did not do Kibana - let me do that as well as a package

Right now I am trying to use my existing application to send logs to fluentd, confused on do we need to add those log-driver appenders in our app ?

As a first step: i am planning to ensure that my app logs get into elastic search - though right now i don't have peta-bytes ;) .. some KBs/MBs for now - will get back to this checpoint stuff once we have our basic microservice working and query and getting us the results.

Next I will develop the microservice that can query the ES to get docs for some pattern we define and add make the microservice configurable via ConfigMap.

Since you guys are in F2F not sure you will be on IRC, in case we want to discuss something, but for now let me get the above mentioned things working.

jstrachan commented 7 years ago

I'm not at the Fuse F2F so will be on IRC ;)

There is no need to change your app to have its logs captured - to be a good logging citizen on kubernetes just log to stdout - that's it. Fluentd then captures all logs for all containers on each host (fluentd runs as a DaemonSet so there is a fluentd pod per host)

So your microservice should be able to monitor the logs of all fabric8 microservices straight away! ;)

kameshsampath commented 7 years ago

ok great! thats something i want to get clarified .. btw is our latest f8-devops snapshots in sonatype ?? I get some issue while building our kibana devops .. lets discuss more when you are in IRC

jstrachan commented 7 years ago

I'd just run logging via the Run... menu in the fabric8 console TBH; then it'll use the last released package

kameshsampath commented 7 years ago

created this repo https://github.com/kameshsampath/fabric8-bug-hunter which has the PoC code.

Whats complete:

Query and get results for a pattern (Exception) on particular namespace, app group - Query Configurable via ConfigMap
Scheduled run of the hunt - configurable via ConfigMap

TODO

What to do with results ? Strategy on persistence
How to avoid repeated exceptions not searched and processed again
How to integrate the results with Pipeline - UI?
Other ideas that might come up based on design of above mentioned points

Issues

Fluentd image that is deployed does not have formatting for multiline hence Java stack traces are shown as multiple docs during Query Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1294168 https://github.com/openshift/origin-aggregated-logging/issues/28

jstrachan commented 7 years ago

A simple approach to processing the exceptions is to collect them and store them in a separate Errors index of elasticsearch and also checkpoint periodically where bug-hunter gets to (timestamps) in the log index; so that on restart it can start from its last checkpoint and keep processing. We need to make sure that bug-hunter can handle re-processing the same data without generating new errors for the same logs though (so maybe ensuring we add the _id value for each exception into the Error object; so that we don't raise false positives.

To make it easier to visualise things over time we maybe want to use the time that the error happened as part of the key of the index; e.g. so we could have a daily Errors index which counts how many of each kind of errors occur (along with a link to the actual error log _id) for nice reporting.

kameshsampath commented 7 years ago

ok @jstrachan - lets push the info back on a to es with our custom json format (built around our model). wondering if we can reuse any thing from collector-utils... will check for it

kameshsampath commented 7 years ago

@jstrachan - can you please review this https://github.com/kameshsampath/fabric8-bug-hunter when you have time.

Got basic data model in place
Hits converted into custom datamodel
K8s application medata like SCM Revision, SCM Branch added , with place holders for issue-tracker-url, project-url etc.,

WIP

saving this extracted data to ES in seperate index called bughunter
avoid extracting and indexing same data again
adding checkpoint ( need some more details on this)
place where to hook and get this data. Grafana ???
integration with pipeline to propose sharing options like gist, add as issue to issue-tracker
ability to ignore resources - need to build some kind of intelligence to the system which can ignore the certain exceptions/patterns
how to build the count logic

kameshsampath commented 7 years ago

@jstrachan - check this new WIP branch - https://github.com/kameshsampath/fabric8-bug-hunter/tree/more-rx, i have moved everything to vert.x service based approach - now this service is available over vert.x event bus - so applications that might need them can reuse.

This now has extracted data successfully, how will be saving them to bughunter index. Please ping me in IRC to discuss further

Will improve it further for a SPI based approach later.

kameshsampath commented 7 years ago

@jstrachan - saving of the extracted data to bughunter index is done https://gist.github.com/kameshsampath/62c011ec70e5ec844b3992277e323036

the master https://github.com/kameshsampath/fabric8-bug-hunter has the latest changes.

WIP

how to build the count logic
avoid extracting and indexing same data again
adding checkpoint ( need some more details on this)
place where to hook and get this data. Grafana ???
integration with pipeline to propose sharing options like gist, add as issue to issue-tracker
ability to ignore resources - need to build some kind of intelligence to the system which can ignore the certain exceptions/patterns

StevenACoffman commented 6 years ago

@kameshsampath have you looked at fluent-plugin-detect-exceptions? Tagging exceptions (or annotating with some other metadata) as bugs might make finding them in ElasticSearch easier.

fabric8io / fabric8

Create a squash.io like tool to detect exceptions in logs and associate them with issues & to try deduce the git blame #6161

Whats complete:

TODO

Issues