apache / skywalking

APM, Application Performance Monitoring System
https://skywalking.apache.org/
Apache License 2.0
23.9k stars 6.52k forks source link

[Feature] Support AWS service monitoring #9883

Closed wu-sheng closed 1 year ago

wu-sheng commented 2 years ago

Search before asking

Description

In the recent several months, mostly due to @pg-yang @yswdqz hard work in these two iterations. We are now having more and more infrastructure monitoring, e.g. Linux, Kubernetes, Database, Gateway, MQ server(TODO, such as Kafka), Scheduler(TODO, such as Airflow)

Let's begin to do some research for monitoring cloud infra. As we know, this would be long-term work, this issue should be a catalog to list all potential components/services

List to be completed

Use case

We wouldn't trace the cloud infra from the code level, this is unrealistic. But most of good cloud services expose metrics through Prometheus, OpenTelemetry, or some cloud APIs. We could leverage those and integrate these monitoring in SkyWalking.

Making all metrics in one place side by side with their application performance data, and alerting through the same system.

@pg-yang told me, he would work on this after virtual MQ.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

pg-yang commented 2 years ago

Yes

pg-yang commented 2 years ago

I updated the task list to add EKS and put it first as it's supported directly by https://github.com/aws-observability/aws-otel-collector, which is more clear and easy than others. Let's begin with EKS

wu-sheng commented 1 year ago

@pg-yang I noticed this, https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscloudwatchreceiver/README.md

I think we could use this for reading metrics of S3 and DynamoDB.

pg-yang commented 1 year ago

Let's pick up S3 as the second task

wu-sheng commented 1 year ago

@pg-yang, @yswdqz told me, he has interests to follow this too. I think you could share what you have about workflow, data formats and important docs here. And mostly, your next PR(aw-otel-receiver) would be widely used. We should follow the dataflow path you found and share.

pg-yang commented 1 year ago

Very happy that work with @yswdqz. I will share the information about aw-otel-receiver tonight. 🔢

pg-yang commented 1 year ago

From what I know

  1. Amazon CloudWatch could collect metrics from most AWS services(e.g. S3, API Gateway).
  2. CloudWatch metrics stream is a feature of CloudWatch, it could stream CloudWatch metrics to a destination of your choice. Actually by my test, the only option of destination is Amazon Kinesis Data Firehose
  3. Amazon Kinesis Data Firehose could send data to HTTP endpoint(refer to Choose HTTP Endpoint), and the HTTP endpoint should follow the specification

So we could develop a module(let's name it aws-firehose-receiver temporarily). aws-firehose-receiver will follow the specification to receive HTTP request from Amazon Kinesis Data Firehose, and convert the HTTP request to OTLP request, and then invoke org.apache.skywalking.oap.server.receiver.otel.otlp.OpenTelemetryMetricHandler#export for reusing code in otel-receiver-plugin.

Here is the data flow of S3(others are similar)

graph LR;
S3 --> CloudWatch --> CloudWatch-stream --> Amazon-Kinesis-Data-Firehose  -->  aws-firehose-receiver --> OpenTelemetryMetricHandler

Here is some doc maybe you want to read

  1. CloudWatch Metric Streams – Send AWS Metrics to Partners and to Your Apps in Real Time
  2. S3 metrics with Cloud Watch
  3. API Gateway metrics with Cloud Watch
  4. DynamoDB metrics with Cloud Watch

Here is S3 metrics sample data in JSON format

{"metric_stream_name":"QuickFull-QwkHX6","account_id":"008835616606","region":"ap-northeast-1","namespace":"AWS/S3","metric_name":"HeadRequests","dimensions":{"BucketName":"metricstreams-quickfull-qwkhx6-cyp2xuov","FilterId":"test-"},"timestamp":1673092920000,"value":{"max":1.0,"min":1.0,"sum":1.0,"count":1.0},"unit":"Count"}
{"metric_stream_name":"QuickFull-QwkHX6","account_id":"008835616606","region":"ap-northeast-1","namespace":"AWS/S3","metric_name":"4xxErrors","dimensions":{"BucketName":"metricstreams-quickfull-qwkhx6-cyp2xuov","FilterId":"test-"},"timestamp":1673092920000,"value":{"max":0.0,"min":0.0,"sum":0.0,"count":1.0},"unit":"Count"}

Notice, CloudWatch metrics stream support JSON, OpenTelemetry format as output, maybe OpenTelemetry format is more convenient to convert HTTP request to OTLP request.

yswdqz commented 1 year ago

Thank you, I will take a try !

wu-sheng commented 1 year ago

@yswdqz Please update the issue once you decided the service.

yswdqz commented 1 year ago

@yswdqz Please update the issue once you decided the service.

OK, I want to choose Amazon DynamoDB. I have updated issue.

yswdqz commented 1 year ago

@pg-yang About aws-firehose-receiver, can I be of any help?

pg-yang commented 1 year ago

About aws-firehose-receiver, can I be of any help?

The HTTP Specifications mentations Currently, only port 443 is supported for HTTP endpoint data delivery.. So we need to support HTTPS for OAP HTTPServer. and confirm if it accepts self-signed certificate

wu-sheng commented 1 year ago

Most cloud services should require TLS as always because TLS(as a part of Zero Trust) is in rigid demand today.

pg-yang commented 1 year ago

Yep, @yswdqz could enhance OAP HTTP server to support HTTPS, It's not conflicting with aws-firehose-receiver

yswdqz commented 1 year ago

Got it.

yswdqz commented 1 year ago

@pg-yang I just need to change org.apache.skywalking.oap.server.library.server.http.HTTPServer to support https?

wu-sheng commented 1 year ago

I think so, this is step one. awsfirehose to otlp to MAL is super easy.

yswdqz commented 1 year ago

@wu-sheng Which port should we open as https port? Should it be defined by the user, or should it be located on the same port as http?

wu-sheng commented 1 year ago

Does this http server listener for awsreceiver only?(I assume so)? Then it should be a new one.

pg-yang commented 1 year ago

By my test, the self-signed certificate will encounter errors such as:

{
    "deliveryStreamARN": "",
    "destination": "35........",
    "deliveryStreamVersionId": 1,
    "message": "Unable to complete an SSL Handshake with the endpoint due to invalid certification path. Contact the owner of the endpoint to resolve this issue.",
    "errorCode": "HttpEndpoint.SSLHandshakeCertificatePathFailure"
}

I will register a domain and public certificate to test firehouse. It also is required for testing AWS monitoring.

kezhenxu94 commented 1 year ago

By my test, the self-signed certificate will encounter errors such as:


{

    "deliveryStreamARN": "",

    "destination": "35........",

    "deliveryStreamVersionId": 1,

    "message": "Unable to complete an SSL Handshake with the endpoint due to invalid certification path. Contact the owner of the endpoint to resolve this issue.",

    "errorCode": "HttpEndpoint.SSLHandshakeCertificatePathFailure"

}

I will register a domain and public certificate to test firehouse. It also is required for testing AWS monitoring.

FWIW, ZeroSSL can issue certificates for an IP address, if you can run the OAP on a machine with a public IP, you can also test it with firehouse.

wu-sheng commented 1 year ago

That is good. We should be able to have public static IP on EC2.

pg-yang commented 1 year ago
        final byte[] decode = Base64.getDecoder().decode(s);
        final byte[] decode2 = Base64.getDecoder().decode(s2);
        final ResourceMetrics resourceMetrics = ResourceMetrics.parseFrom(Arrays.copyOfRange(decode, 5, decode.length));
        final ResourceMetrics resourceMetrics2 = ResourceMetrics.parseFrom(Arrays.copyOfRange(decode2, 5, decode2.length));

Hello, above code could work, but I don't know why. s and s2 encoded by base64. As mentioned in
OpenTelemetry 0.7.0 format: Each data structure starts with a header with an UnsignedVarInt32 indicating the record length in bytes. UnsignedVarInt32 is 4 bytes, right? But the code works correctly when I skip 5 byte, skipping 4 byte will encounter an exception.

pg-yang commented 1 year ago

OK, vary int 32. 🔢 If one knows the best way to parse var int, please tell me.

wu-sheng commented 1 year ago

The process doc should be https://developers.google.com/protocol-buffers/docs/encoding#varints

Maybe this helps. https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/util/VarInt.java#L88 Notice, we can't copy codes.

wu-sheng commented 1 year ago

Maybe this helps. https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/util/VarInt.java#L88 Notice, we can't copy codes.

Can't, I meam, if we need to copy, we need to update the license in the root about mentioning this copy.

wu-sheng commented 1 year ago

For 9.4, we should scope the support list to the above four. If we have time to add more, let's open new issues to track them separately. As I feel, with @pg-yang 's recent works, we should make the infrastructure(firehose receiver) ready, then others should mostly be about AWS setup, MAL, and UI dashboard setups.

pg-yang commented 1 year ago

FYI @yswdqz AWS metrics name contain special strings such /, .. org.apache.skywalking.oap.meter.analyzer.prometheus.PrometheusMetricConverter#escapedName only convert . to _. we should convert / as well.

Here is a demo of AWS s3 metrics name: amazonaws.com/AWS/S3/BytesDownloaded.

yswdqz commented 1 year ago

Got it, I will try to fix it.

wu-sheng commented 1 year ago

Why don't we change PrometheusMetricConverter#escapedMetricsNameCache directly to support / as well? I think / is illegal for MAL no matter the sources of metrics.

And @hanahmily @kezhenxu94 We don't have Prometheus fetcher anymore, what do you suggest for a new name of PrometheusMetricConverter? This class name seems unreasonable in the codes.

yswdqz commented 1 year ago

Why don't we change PrometheusMetricConverter#escapedMetricsNameCache directly to support / as well? I think / is illegal for MAL no matter the sources of metrics.

I also think so. I'm going to fix this, too.

kezhenxu94 commented 1 year ago

And @hanahmily @kezhenxu94 We don't have Prometheus fetcher anymore, what do you suggest for a new name of PrometheusMetricConverter? This class name seems unreasonable in the codes.

It looks reasonable to me because I explain PrometheusMetricConverter as "a converter to convert Prometheus metrics format". Besides, we have many other classes/package named with prometheus 👇 all these are referring to the Prometheus data format, not Prometheus fetcher in my mind.

import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Counter;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Gauge;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Histogram;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Metric;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Summary;