Closed wu-sheng closed 1 year ago
Yes
I updated the task list to add EKS and put it first as it's supported directly by https://github.com/aws-observability/aws-otel-collector, which is more clear and easy than others. Let's begin with EKS
@pg-yang I noticed this, https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscloudwatchreceiver/README.md
I think we could use this for reading metrics of S3 and DynamoDB.
Let's pick up S3 as the second task
@pg-yang, @yswdqz told me, he has interests to follow this too. I think you could share what you have about workflow, data formats and important docs here. And mostly, your next PR(aw-otel-receiver) would be widely used. We should follow the dataflow path you found and share.
Very happy that work with @yswdqz. I will share the information about aw-otel-receiver tonight. 🔢
From what I know
So we could develop a module(let's name it aws-firehose-receiver
temporarily). aws-firehose-receiver
will follow the specification to receive HTTP request from Amazon Kinesis Data Firehose, and convert the HTTP request to OTLP request, and then invoke org.apache.skywalking.oap.server.receiver.otel.otlp.OpenTelemetryMetricHandler#export
for reusing code in otel-receiver-plugin
.
Here is the data flow of S3(others are similar)
graph LR;
S3 --> CloudWatch --> CloudWatch-stream --> Amazon-Kinesis-Data-Firehose --> aws-firehose-receiver --> OpenTelemetryMetricHandler
Here is some doc maybe you want to read
Here is S3 metrics sample data in JSON format
{"metric_stream_name":"QuickFull-QwkHX6","account_id":"008835616606","region":"ap-northeast-1","namespace":"AWS/S3","metric_name":"HeadRequests","dimensions":{"BucketName":"metricstreams-quickfull-qwkhx6-cyp2xuov","FilterId":"test-"},"timestamp":1673092920000,"value":{"max":1.0,"min":1.0,"sum":1.0,"count":1.0},"unit":"Count"}
{"metric_stream_name":"QuickFull-QwkHX6","account_id":"008835616606","region":"ap-northeast-1","namespace":"AWS/S3","metric_name":"4xxErrors","dimensions":{"BucketName":"metricstreams-quickfull-qwkhx6-cyp2xuov","FilterId":"test-"},"timestamp":1673092920000,"value":{"max":0.0,"min":0.0,"sum":0.0,"count":1.0},"unit":"Count"}
Notice, CloudWatch metrics stream support JSON, OpenTelemetry format as output, maybe OpenTelemetry format is more convenient to convert HTTP request to OTLP request.
Thank you, I will take a try !
@yswdqz Please update the issue once you decided the service.
@yswdqz Please update the issue once you decided the service.
OK, I want to choose Amazon DynamoDB
. I have updated issue.
@pg-yang About aws-firehose-receiver
, can I be of any help?
About
aws-firehose-receiver
, can I be of any help?
The HTTP Specifications mentations Currently, only port 443 is supported for HTTP endpoint data delivery.
. So we need to support HTTPS for OAP HTTPServer. and confirm if it accepts self-signed certificate
Most cloud services should require TLS as always because TLS(as a part of Zero Trust) is in rigid demand today.
Yep, @yswdqz could enhance OAP HTTP server to support HTTPS, It's not conflicting with aws-firehose-receiver
Got it.
@pg-yang I just need to change org.apache.skywalking.oap.server.library.server.http.HTTPServer
to support https?
I think so, this is step one. awsfirehose to otlp to MAL is super easy.
@wu-sheng Which port should we open as https port? Should it be defined by the user, or should it be located on the same port as http?
Does this http server listener for awsreceiver only?(I assume so)? Then it should be a new one.
By my test, the self-signed certificate will encounter errors such as:
{
"deliveryStreamARN": "",
"destination": "35........",
"deliveryStreamVersionId": 1,
"message": "Unable to complete an SSL Handshake with the endpoint due to invalid certification path. Contact the owner of the endpoint to resolve this issue.",
"errorCode": "HttpEndpoint.SSLHandshakeCertificatePathFailure"
}
I will register a domain and public certificate to test firehouse. It also is required for testing AWS monitoring.
By my test, the self-signed certificate will encounter errors such as:
{ "deliveryStreamARN": "", "destination": "35........", "deliveryStreamVersionId": 1, "message": "Unable to complete an SSL Handshake with the endpoint due to invalid certification path. Contact the owner of the endpoint to resolve this issue.", "errorCode": "HttpEndpoint.SSLHandshakeCertificatePathFailure" }
I will register a domain and public certificate to test firehouse. It also is required for testing AWS monitoring.
FWIW, ZeroSSL can issue certificates for an IP address, if you can run the OAP on a machine with a public IP, you can also test it with firehouse.
That is good. We should be able to have public static IP on EC2.
final byte[] decode = Base64.getDecoder().decode(s);
final byte[] decode2 = Base64.getDecoder().decode(s2);
final ResourceMetrics resourceMetrics = ResourceMetrics.parseFrom(Arrays.copyOfRange(decode, 5, decode.length));
final ResourceMetrics resourceMetrics2 = ResourceMetrics.parseFrom(Arrays.copyOfRange(decode2, 5, decode2.length));
Hello, above code could work, but I don't know why. s and s2 encoded by base64. As mentioned in
OpenTelemetry 0.7.0 format: Each data structure starts with a header with an UnsignedVarInt32 indicating the record length in bytes.
UnsignedVarInt32
is 4 bytes, right? But the code works correctly when I skip 5 byte, skipping 4 byte will encounter an exception.
OK, vary int 32. 🔢 If one knows the best way to parse var int, please tell me.
The process doc should be https://developers.google.com/protocol-buffers/docs/encoding#varints
Maybe this helps. https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/util/VarInt.java#L88 Notice, we can't copy codes.
Maybe this helps. https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/util/VarInt.java#L88 Notice, we can't copy codes.
Can't, I meam, if we need to copy, we need to update the license in the root about mentioning this copy.
For 9.4, we should scope the support list to the above four. If we have time to add more, let's open new issues to track them separately. As I feel, with @pg-yang 's recent works, we should make the infrastructure(firehose receiver) ready, then others should mostly be about AWS setup, MAL, and UI dashboard setups.
FYI @yswdqz AWS metrics name contain special strings such /
, .
. org.apache.skywalking.oap.meter.analyzer.prometheus.PrometheusMetricConverter#escapedName
only convert .
to _
. we should convert /
as well.
Here is a demo of AWS s3 metrics name: amazonaws.com/AWS/S3/BytesDownloaded
.
Got it, I will try to fix it.
Why don't we change PrometheusMetricConverter#escapedMetricsNameCache
directly to support /
as well? I think /
is illegal for MAL no matter the sources of metrics.
And @hanahmily @kezhenxu94 We don't have Prometheus fetcher anymore, what do you suggest for a new name of PrometheusMetricConverter
? This class name seems unreasonable in the codes.
Why don't we change
PrometheusMetricConverter#escapedMetricsNameCache
directly to support/
as well? I think/
is illegal for MAL no matter the sources of metrics.
I also think so. I'm going to fix this, too.
And @hanahmily @kezhenxu94 We don't have Prometheus fetcher anymore, what do you suggest for a new name of
PrometheusMetricConverter
? This class name seems unreasonable in the codes.
It looks reasonable to me because I explain PrometheusMetricConverter
as "a converter to convert Prometheus metrics format". Besides, we have many other classes/package named with prometheus 👇 all these are referring to the Prometheus data format, not Prometheus fetcher in my mind.
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Counter;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Gauge;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Histogram;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Metric;
import org.apache.skywalking.oap.server.library.util.prometheus.metrics.Summary;
Search before asking
Description
In the recent several months, mostly due to @pg-yang @yswdqz hard work in these two iterations. We are now having more and more infrastructure monitoring, e.g. Linux, Kubernetes, Database, Gateway, MQ server(TODO, such as Kafka), Scheduler(TODO, such as Airflow)
Let's begin to do some research for monitoring cloud infra. As we know, this would be long-term work, this issue should be a catalog to list all potential components/services
List to be completed
Use case
We wouldn't trace the cloud infra from the code level, this is unrealistic. But most of good cloud services expose metrics through Prometheus, OpenTelemetry, or some cloud APIs. We could leverage those and integrate these monitoring in SkyWalking.
Making all metrics in one place side by side with their application performance data, and alerting through the same system.
@pg-yang told me, he would work on this after virtual MQ.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct