Open monperrus opened 6 years ago
Log analysis @Eclipse https://projects.eclipse.org/projects/tools.tracecompass
We've found Istio ( https://istio.io/ ) to be increasingly useful in this context. KubeSpy ( https://github.com/pulumi/kubespy )is an excellent tool for troubleshooting and diagnosing Kubernetes deployments.
+1 for Prometheus
Sentry for Error Reporting. https://sentry.io/welcome/
See also Runtime application self-protection https://github.com/KTH/devops-course/issues/18#issuecomment-435888119
Analytics
Tools and Benchmarks for Automated Log Parsing. http://arxiv.org/abs/1811.03509
Does the Fault Reside in a Stack Trace? Assisting Crash Localization by Predicting Crashing Fault Residence https://www.sciencedirect.com/science/article/pii/S0164121218302401
Having good dashboards is essential in DevOps, see Kibana, etc.
Made in Alibaba: https://github.com/alibaba/Sentinel
JVM Profiler Sending Metrics to Kafka (https://kafka.apache.org/), Console Output or Custom Reporter https://github.com/uber-common/jvm-profiler
Time-series database to store monitoring data https://en.wikipedia.org/wiki/Time_series_database
Prometheus - Monitoring system & time series database https://prometheus.io/
Netflix Zuul is a gateway service that provides dynamic routing, monitoring, resiliency, security, and more. https://github.com/Netflix/zuul
OpenTracing https://opentracing.io/
Sensu is a free and open source monitoring that handles cloud environments. Sensu allows you to monitor servers, services, application health, and business KPIs. https://xebialabs.com/technology/sensu/
Provenance analysis tools
Framework for instruction-level tracing and analysis of program executions http://static.usenix.org/event/vee06/full_papers/p154-bhansali.pdf
DevOps Metrics https://queue.acm.org/detail.cfm?id=3182626
Dapper, a large-scale distributed systems tracing infrastructure at Google http://research.google.com/pubs/pub36356.html
Chaos Engineering & Observability https://www.infoq.com/news/2019/03/chaos-engineering-observability
Humio: All of your data: logs, metrics, traces. Search, analyze and visualize instantly. Live system observability. https://humio.com/
The OpenTracing project https://opentracing.io/
Papers:
I cannot recommend Ben Sigelman enough
https://www.infoq.com/presentations/google-microservices
Ex google ; founded his company from the learnings Must watch
Honeycomb is a tool for introspecting and interrogating your production systems. https://www.honeycomb.io/
LightStep answers questions and diagnoses anomalies at scale, spanning mobile, monoliths, and microservices https://lightstep.com/
Datadog: https://www.datadoghq.com/
Article: New distributed tracing API completes the feedback loop https://www.theserverside.com/feature/New-distributed-tracing-API-completes-the-feedback-loop
Flame graphs and perf-top for JVMs inside Docker containers http://www.batey.info/docker-jvm-flamegraphs.html
Synthetic Kubernetes cluster monitoring with Kuberhealthy https://opensource.com/article/19/4/kuberhealthy
Course notes on monitoring: https://www.monperrus.net/martin/monitoring.pdf
Kiali project, observability for the Istio service mesh (thx @DokID) https://github.com/kiali/kiali
transmitting metrics at scale https://openmetrics.io/
Learning Chaos Engineering and Chaos toolkit on katacoda: https://www.katacoda.com/chaostoolkit
Contemporary Software Monitoring: A Systematic Literature Review https://arxiv.org/abs/1912.05878
A curated list of Chaos Engineering resources. https://github.com/dastergon/awesome-chaos-engineering/
Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.
https://www.gartner.com/smarterwithgartner/the-io-leaders-guide-to-chaos-engineering/
Contemporary Software Monitoring: A Systematic Mapping Study. http://arxiv.org/pdf/1912.05878
Cilium - eBPF-based Networking, Observability, and Security Cilium's control plane is highly optimized, running in Kubernetes clusters of up to 5K nodes and 100K pod https://cilium.io/
Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. Can be used for monitoring events. Can be bridged with MQTT. https://aws.amazon.com/kinesis/data-streams/
Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.
Can be used to feed Prometheus.
Prometheus client libraries (including both official ones and many third-party ones) can be found here: https://prometheus.io/docs/instrumenting/clientlibs/
Paper: "Enjoy your observability: an industrial survey of microservice tracing and analysis" http://link.springer.com/10.1007/s10664-021-10063-9