KTH / devops-course

Repository of the DevOps course at KTH Royal Institute of Technology DD2482
165 stars 343 forks source link

Monitoring, tracing, observability in DevOps #8

Open monperrus opened 6 years ago

monperrus commented 6 years ago
monperrus commented 5 years ago

See also icinga (thanks to @henriklb for the suggestion)

monperrus commented 5 years ago

Log analysis @Eclipse https://projects.eclipse.org/projects/tools.tracecompass

MatsJonsson commented 5 years ago

We've found Istio ( https://istio.io/ ) to be increasingly useful in this context. KubeSpy ( https://github.com/pulumi/kubespy )is an excellent tool for troubleshooting and diagnosing Kubernetes deployments.

lsc commented 5 years ago
MatsJonsson commented 5 years ago

+1 for Prometheus

bittermandel commented 5 years ago

Sentry for Error Reporting. https://sentry.io/welcome/

monperrus commented 5 years ago

(from https://github.com/KTH/devops-course/issues/16#issue-371440053)

monperrus commented 5 years ago

See also Runtime application self-protection https://github.com/KTH/devops-course/issues/18#issuecomment-435888119

monperrus commented 5 years ago

Analytics

monperrus commented 5 years ago

Tools and Benchmarks for Automated Log Parsing. http://arxiv.org/abs/1811.03509

monperrus commented 5 years ago

Does the Fault Reside in a Stack Trace? Assisting Crash Localization by Predicting Crashing Fault Residence https://www.sciencedirect.com/science/article/pii/S0164121218302401

monperrus commented 5 years ago

Having good dashboards is essential in DevOps, see Kibana, etc.

monperrus commented 5 years ago

Made in Alibaba: https://github.com/alibaba/Sentinel

monperrus commented 5 years ago

JVM Profiler Sending Metrics to Kafka (https://kafka.apache.org/), Console Output or Custom Reporter https://github.com/uber-common/jvm-profiler

monperrus commented 5 years ago

https://github.com/madflojo/automatron

monperrus commented 5 years ago

https://github.com/apache/incubator-skywalking

monperrus commented 5 years ago

Time-series database to store monitoring data https://en.wikipedia.org/wiki/Time_series_database

monperrus commented 5 years ago

Prometheus - Monitoring system & time series database https://prometheus.io/

monperrus commented 5 years ago

Netflix Zuul is a gateway service that provides dynamic routing, monitoring, resiliency, security, and more. https://github.com/Netflix/zuul

monperrus commented 5 years ago

OpenTracing https://opentracing.io/

monperrus commented 5 years ago

Nagios https://en.wikipedia.org/wiki/Nagios

monperrus commented 5 years ago

Sensu is a free and open source monitoring that handles cloud environments. Sensu allows you to monitor servers, services, application health, and business KPIs. https://xebialabs.com/technology/sensu/

bbaudry commented 5 years ago

Provenance analysis tools

monperrus commented 5 years ago

Framework for instruction-level tracing and analysis of program executions http://static.usenix.org/event/vee06/full_papers/p154-bhansali.pdf

monperrus commented 5 years ago

DevOps Metrics https://queue.acm.org/detail.cfm?id=3182626

monperrus commented 5 years ago

Dapper, a large-scale distributed systems tracing infrastructure at Google http://research.google.com/pubs/pub36356.html

monperrus commented 5 years ago

Chaos Engineering & Observability https://www.infoq.com/news/2019/03/chaos-engineering-observability

monperrus commented 5 years ago

Humio: All of your data: logs, metrics, traces. Search, analyze and visualize instantly. Live system observability. https://humio.com/

monperrus commented 5 years ago

The OpenTracing project https://opentracing.io/

monperrus commented 5 years ago

Papers:

veggiemonk commented 5 years ago

I cannot recommend Ben Sigelman enough

https://www.infoq.com/presentations/google-microservices

Ex google ; founded his company from the learnings Must watch

monperrus commented 5 years ago

Honeycomb is a tool for introspecting and interrogating your production systems. https://www.honeycomb.io/

monperrus commented 5 years ago

LightStep answers questions and diagnoses anomalies at scale, spanning mobile, monoliths, and microservices https://lightstep.com/

monperrus commented 5 years ago

Datadog: https://www.datadoghq.com/

monperrus commented 5 years ago

Article: New distributed tracing API completes the feedback loop https://www.theserverside.com/feature/New-distributed-tracing-API-completes-the-feedback-loop

bbaudry commented 5 years ago

Flame graphs and perf-top for JVMs inside Docker containers http://www.batey.info/docker-jvm-flamegraphs.html

monperrus commented 5 years ago

Synthetic Kubernetes cluster monitoring with Kuberhealthy https://opensource.com/article/19/4/kuberhealthy

monperrus commented 5 years ago

Course notes on monitoring: https://www.monperrus.net/martin/monitoring.pdf

monperrus commented 5 years ago

Kiali project, observability for the Istio service mesh (thx @DokID) https://github.com/kiali/kiali

bbaudry commented 4 years ago

transmitting metrics at scale https://openmetrics.io/

monperrus commented 4 years ago

Learning Chaos Engineering and Chaos toolkit on katacoda: https://www.katacoda.com/chaostoolkit

monperrus commented 4 years ago

Contemporary Software Monitoring: A Systematic Literature Review https://arxiv.org/abs/1912.05878

gluckzhang commented 4 years ago

A curated list of Chaos Engineering resources. https://github.com/dastergon/awesome-chaos-engineering/

gluckzhang commented 4 years ago

Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.

https://www.gartner.com/smarterwithgartner/the-io-leaders-guide-to-chaos-engineering/

monperrus commented 3 years ago

Contemporary Software Monitoring: A Systematic Mapping Study. http://arxiv.org/pdf/1912.05878

monperrus commented 2 years ago

Cilium - eBPF-based Networking, Observability, and Security Cilium's control plane is highly optimized, running in Kubernetes clusters of up to 5K nodes and 100K pod https://cilium.io/

monperrus commented 2 years ago

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. Can be used for monitoring events. Can be bridged with MQTT. https://aws.amazon.com/kinesis/data-streams/

monperrus commented 2 years ago

Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.

Can be used to feed Prometheus.

https://micrometer.io/

gluckzhang commented 2 years ago

Prometheus client libraries (including both official ones and many third-party ones) can be found here: https://prometheus.io/docs/instrumenting/clientlibs/

monperrus commented 2 years ago

Paper: "Enjoy your observability: an industrial survey of microservice tracing and analysis" http://link.springer.com/10.1007/s10664-021-10063-9