elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
187 stars 392 forks source link

[AWS Usage] Overlapping documents when enabling TSDB - no more dimensions available #6783

Open constanca-m opened 1 year ago

constanca-m commented 1 year ago

Currently there is no way to distinguish between some documents from AWS Usage. If we enable TSDB with the dimensions set as of now, they will not be enough and we will end up losing data. However, there are no keyword fields available to differentiate between the these set of documents. Example:

Document 1 ```json { "_index": ".ds-metrics-aws.usage-default-2023.06.29-000001", "_id": "VaIZB4kBLpMqNjezszQ9", "_version": 1, "_score": 0, "_source": { "cloud": { "provider": "aws", "region": "sa-east-1", "account": { "name": "elastic-observability", "id": "627286350134" } }, "agent": { "name": "kind-control-plane", "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "type": "metricbeat", "ephemeral_id": "e63bc826-7b49-4d1d-85f0-40340b77461d", "version": "8.8.0" }, "@timestamp": "2023-06-29T12:20:00.000Z", "ecs": { "version": "8.0.0" }, "data_stream": { "namespace": "default", "type": "metrics", "dataset": "aws.usage" }, "service": { "type": "aws" }, "elastic_agent": { "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "version": "8.8.0", "snapshot": true }, "host": { "hostname": "kind-control-plane", "os": { "kernel": "5.15.49-linuxkit", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.6 LTS (Focal Fossa)", "platform": "ubuntu" }, "containerized": false, "ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "name": "kind-control-plane", "id": "0aab3a64904042bdb1c956d6fe2fa4f1", "mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "architecture": "x86_64" }, "metricset": { "period": 60000, "name": "cloudwatch" }, "aws": { "usage": { "metrics": { "CallCount": { "sum": 28 } } }, "cloudwatch": { "namespace": "AWS/Usage" }, "dimensions": { "Type": "API", "Resource": "ListMetrics", "Service": "CloudWatch", "Class": "None" } }, "event": { "duration": 9649888084, "agent_id_status": "verified", "ingested": "2023-06-29T12:21:12Z", "module": "aws", "dataset": "aws.usage" } }, "fields": { "elastic_agent.version": [ "8.8.0" ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "kind-control-plane" ], "host.mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "service.type": [ "aws" ], "host.ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "agent.type": [ "metricbeat" ], "aws.dimensions.Class": [ "None" ], "event.module": [ "aws" ], "host.os.version": [ "20.04.6 LTS (Focal Fossa)" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "host.os.name": [ "Ubuntu" ], "aws.cloudwatch.namespace": [ "AWS/Usage" ], "agent.name": [ "kind-control-plane" ], "elastic_agent.snapshot": [ true ], "host.name": [ "kind-control-plane" ], "event.agent_id_status": [ "verified" ], "aws.dimensions.Service": [ "CloudWatch" ], "host.id": [ "0aab3a64904042bdb1c956d6fe2fa4f1" ], "aws.usage.metrics.CallCount.sum": [ 28 ], "cloud.region": [ "sa-east-1" ], "host.os.type": [ "linux" ], "cloud.account.name": [ "elastic-observability" ], "elastic_agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "data_stream.namespace": [ "default" ], "metricset.period": [ 60000 ], "aws.dimensions.Type": [ "API" ], "host.os.codename": [ "focal" ], "data_stream.type": [ "metrics" ], "event.duration": [ 9649888084 ], "host.architecture": [ "x86_64" ], "metricset.name": [ "cloudwatch" ], "cloud.provider": [ "aws" ], "event.ingested": [ "2023-06-29T12:21:12.000Z" ], "@timestamp": [ "2023-06-29T12:20:00.000Z" ], "agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "host.containerized": [ false ], "ecs.version": [ "8.0.0" ], "host.os.platform": [ "ubuntu" ], "cloud.account.id": [ "627286350134" ], "data_stream.dataset": [ "aws.usage" ], "agent.ephemeral_id": [ "e63bc826-7b49-4d1d-85f0-40340b77461d" ], "agent.version": [ "8.8.0" ], "aws.dimensions.Resource": [ "ListMetrics" ], "host.os.family": [ "debian" ], "event.dataset": [ "aws.usage" ] } } ```
Document 2 ```json { "_index": ".ds-metrics-aws.usage-default-2023.06.29-000001", "_id": "aaIaB4kBLpMqNjezKDWL", "_version": 1, "_score": 0, "_source": { "cloud": { "provider": "aws", "region": "sa-east-1", "account": { "name": "elastic-observability", "id": "627286350134" } }, "agent": { "name": "kind-control-plane", "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "type": "metricbeat", "ephemeral_id": "e63bc826-7b49-4d1d-85f0-40340b77461d", "version": "8.8.0" }, "@timestamp": "2023-06-29T12:20:00.000Z", "ecs": { "version": "8.0.0" }, "service": { "type": "aws" }, "data_stream": { "namespace": "default", "type": "metrics", "dataset": "aws.usage" }, "host": { "hostname": "kind-control-plane", "os": { "kernel": "5.15.49-linuxkit", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.6 LTS (Focal Fossa)", "platform": "ubuntu" }, "ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "containerized": false, "name": "kind-control-plane", "id": "0aab3a64904042bdb1c956d6fe2fa4f1", "mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "architecture": "x86_64" }, "elastic_agent": { "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "version": "8.8.0", "snapshot": true }, "metricset": { "period": 60000, "name": "cloudwatch" }, "aws": { "usage": { "metrics": { "CallCount": { "sum": 40 } } }, "cloudwatch": { "namespace": "AWS/Usage" }, "dimensions": { "Type": "API", "Resource": "ListMetrics", "Service": "CloudWatch", "Class": "None" } }, "event": { "duration": 9720431083, "agent_id_status": "verified", "ingested": "2023-06-29T12:21:42Z", "module": "aws", "dataset": "aws.usage" } }, "fields": { "elastic_agent.version": [ "8.8.0" ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "kind-control-plane" ], "host.mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "service.type": [ "aws" ], "host.ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "agent.type": [ "metricbeat" ], "aws.dimensions.Class": [ "None" ], "event.module": [ "aws" ], "host.os.version": [ "20.04.6 LTS (Focal Fossa)" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "host.os.name": [ "Ubuntu" ], "aws.cloudwatch.namespace": [ "AWS/Usage" ], "agent.name": [ "kind-control-plane" ], "elastic_agent.snapshot": [ true ], "host.name": [ "kind-control-plane" ], "event.agent_id_status": [ "verified" ], "aws.dimensions.Service": [ "CloudWatch" ], "host.id": [ "0aab3a64904042bdb1c956d6fe2fa4f1" ], "aws.usage.metrics.CallCount.sum": [ 40 ], "cloud.region": [ "sa-east-1" ], "host.os.type": [ "linux" ], "cloud.account.name": [ "elastic-observability" ], "elastic_agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "data_stream.namespace": [ "default" ], "metricset.period": [ 60000 ], "aws.dimensions.Type": [ "API" ], "host.os.codename": [ "focal" ], "data_stream.type": [ "metrics" ], "event.duration": [ 9720431083 ], "host.architecture": [ "x86_64" ], "metricset.name": [ "cloudwatch" ], "cloud.provider": [ "aws" ], "event.ingested": [ "2023-06-29T12:21:42.000Z" ], "@timestamp": [ "2023-06-29T12:20:00.000Z" ], "agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "host.containerized": [ false ], "ecs.version": [ "8.0.0" ], "host.os.platform": [ "ubuntu" ], "cloud.account.id": [ "627286350134" ], "data_stream.dataset": [ "aws.usage" ], "agent.ephemeral_id": [ "e63bc826-7b49-4d1d-85f0-40340b77461d" ], "agent.version": [ "8.8.0" ], "aws.dimensions.Resource": [ "ListMetrics" ], "host.os.family": [ "debian" ], "event.dataset": [ "aws.usage" ] } } ```

This issue might be hard to reproduce. When testing, I got the output: Out of 40000 documents from the index .ds-metrics-aws.usage-default-2023.06.29-000001, 429 of them were discarded., which means that this is happening with just 1% of the documents.

tetianakravchenko commented 1 year ago

I think I could reproduce similar issue but with the s3_request data_stream. The only 2 fields that are different for 2 documents are event.duration and event.ingested (additionally to the _id field):

Screenshot 2023-07-05 at 22 10 29

From my understanding those fields are ingested by the bets.

Could be that it is a different issue, because for usage data_stream - there is one more field aws.usage.metrics.CallCount.sum that has different value for 2 documents

tommyers-elastic commented 1 year ago

been running the usage integration all day at 1hr collection periods and also not seen any dups. under what circumstances have you all seen this happen?

i'm conflicted about what to do about this, because if we decide to go ahead with TSDB for these datasets, it would be very unlikely we would see this issue again (the issue is masked). but i have a feeling that for the cases we have seen here, the duplicate values are acting like cumulative counters over the collection period - i.e. the newer value is the one we should take, and the opposite happens with TSDB (the first value is the one we take).

i think figuring out how to repro this, and determining if the above theory is correct regarding cumulative values should be the next step. although it doesn't seem like a particularly impactful issue, metric accuracy is important for us, and we should do what we can to figure it out.

constanca-m commented 1 year ago

been running the usage integration all day at 1hr collection periods and also not seen any dups. under what circumstances have you all seen this happen?

I honestly have no idea. I left AWS Usage running and it just appeared after some time with documents overlapping. I had many, many documents at that time. Sometimes I run the TSDB test for 40k documents and I do not see any overlap, so I have no idea what could be causing this.

How are you checking if you can reproduce this @tommyers-elastic ? My suggestion would be to leave AWS Usage for 1 day and test on all documents, I am sure you would end up seeing the overlap.

tetianakravchenko commented 1 year ago

It seems there are few (potential) issues: first use case - if restarting elastic-agent there will be added the same document (since there were not many changes, so metrics are the same, tested with the S3 data_streams), but the _id, agent.ephemeral_id, event.duration and event.ingested are different.

How to reproduce

  1. start stack using elastic-package-0.83.2 stack up -d --version 8.8.0 -vv
  2. add aws integration to the Elastic-Agent (elastic-package) policy
  3. check that you got some data
  4. run docker restart elastic-package-stack-elastic-agent-1
  5. verify that data was added again with the same time stamp as before

@constanca-m can you please check if you can reproduce it with usage data_stream?

for this case it is not clear why the documents are added with the same timestamp?

second use case - the one when _id, event.duration and event.ingested are different, but agent.ephemeral_id is the same - trying to reproduce

constanca-m commented 1 year ago

I don't think it is necessary to go that far. I was testing using Elastic Cloud and the documents I had on aws.usage data stream. I checked the overwritten documents, and indeed, some of them only have those fields as a difference:

image

And another example in the same data stream:

image

So the document is the same, but it is weird that some metric changes sometimes:

image

I believe this last case is even harder to find. From the set of 10 documents, I think only one had a change of value on a metric.

I think these documents are all the same, which in that case, it is exactly what TSDB is for: discard the same document to save storage space.

tetianakravchenko commented 1 year ago

I think these documents are all the same, which in that case, it is exactly what TSDB is for: discard the same document to save storage space.

with tsdb enabled those duplicated documents will be silently dropped, but still generated, processed on the beats side and sent to elasticsearch, that is not optimal.

kaiyan-sheng commented 10 months ago

One thing we can try is to calculate a document ID based on the unique identifiers of that document. Right now we don't specify the ID so when metricbeat/agent restarts, two documents will be sent to ES with diff ID but same metrics.