Open constanca-m opened 1 year ago
I think I could reproduce similar issue but with the s3_request
data_stream. The only 2 fields that are different for 2 documents are event.duration
and event.ingested
(additionally to the _id
field):
From my understanding those fields are ingested by the bets.
Could be that it is a different issue, because for usage
data_stream - there is one more field aws.usage.metrics.CallCount.sum
that has different value for 2 documents
been running the usage integration all day at 1hr collection periods and also not seen any dups. under what circumstances have you all seen this happen?
i'm conflicted about what to do about this, because if we decide to go ahead with TSDB for these datasets, it would be very unlikely we would see this issue again (the issue is masked). but i have a feeling that for the cases we have seen here, the duplicate values are acting like cumulative counters over the collection period - i.e. the newer value is the one we should take, and the opposite happens with TSDB (the first value is the one we take).
i think figuring out how to repro this, and determining if the above theory is correct regarding cumulative values should be the next step. although it doesn't seem like a particularly impactful issue, metric accuracy is important for us, and we should do what we can to figure it out.
been running the usage integration all day at 1hr collection periods and also not seen any dups. under what circumstances have you all seen this happen?
I honestly have no idea. I left AWS Usage running and it just appeared after some time with documents overlapping. I had many, many documents at that time. Sometimes I run the TSDB test for 40k documents and I do not see any overlap, so I have no idea what could be causing this.
How are you checking if you can reproduce this @tommyers-elastic ? My suggestion would be to leave AWS Usage for 1 day and test on all documents, I am sure you would end up seeing the overlap.
It seems there are few (potential) issues:
first use case - if restarting elastic-agent there will be added the same document (since there were not many changes, so metrics are the same, tested with the S3 data_streams), but the _id
, agent.ephemeral_id
, event.duration
and event.ingested
are different.
How to reproduce
elastic-package-0.83.2 stack up -d --version 8.8.0 -vv
Elastic-Agent (elastic-package)
policydocker restart elastic-package-stack-elastic-agent-1
@constanca-m can you please check if you can reproduce it with usage data_stream?
for this case it is not clear why the documents are added with the same timestamp?
second use case - the one when _id
, event.duration
and event.ingested
are different, but agent.ephemeral_id
is the same - trying to reproduce
I don't think it is necessary to go that far. I was testing using Elastic Cloud and the documents I had on aws.usage data stream. I checked the overwritten documents, and indeed, some of them only have those fields as a difference:
And another example in the same data stream:
So the document is the same, but it is weird that some metric changes sometimes:
I believe this last case is even harder to find. From the set of 10 documents, I think only one had a change of value on a metric.
I think these documents are all the same, which in that case, it is exactly what TSDB is for: discard the same document to save storage space.
I think these documents are all the same, which in that case, it is exactly what TSDB is for: discard the same document to save storage space.
with tsdb enabled those duplicated documents will be silently dropped, but still generated, processed on the beats side and sent to elasticsearch, that is not optimal.
One thing we can try is to calculate a document ID based on the unique identifiers of that document. Right now we don't specify the ID so when metricbeat/agent restarts, two documents will be sent to ES with diff ID but same metrics.
Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
. Thank you for your contribution!
Currently there is no way to distinguish between some documents from AWS Usage. If we enable TSDB with the dimensions set as of now, they will not be enough and we will end up losing data. However, there are no keyword fields available to differentiate between the these set of documents. Example:
Document 1
```json { "_index": ".ds-metrics-aws.usage-default-2023.06.29-000001", "_id": "VaIZB4kBLpMqNjezszQ9", "_version": 1, "_score": 0, "_source": { "cloud": { "provider": "aws", "region": "sa-east-1", "account": { "name": "elastic-observability", "id": "627286350134" } }, "agent": { "name": "kind-control-plane", "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "type": "metricbeat", "ephemeral_id": "e63bc826-7b49-4d1d-85f0-40340b77461d", "version": "8.8.0" }, "@timestamp": "2023-06-29T12:20:00.000Z", "ecs": { "version": "8.0.0" }, "data_stream": { "namespace": "default", "type": "metrics", "dataset": "aws.usage" }, "service": { "type": "aws" }, "elastic_agent": { "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "version": "8.8.0", "snapshot": true }, "host": { "hostname": "kind-control-plane", "os": { "kernel": "5.15.49-linuxkit", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.6 LTS (Focal Fossa)", "platform": "ubuntu" }, "containerized": false, "ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "name": "kind-control-plane", "id": "0aab3a64904042bdb1c956d6fe2fa4f1", "mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "architecture": "x86_64" }, "metricset": { "period": 60000, "name": "cloudwatch" }, "aws": { "usage": { "metrics": { "CallCount": { "sum": 28 } } }, "cloudwatch": { "namespace": "AWS/Usage" }, "dimensions": { "Type": "API", "Resource": "ListMetrics", "Service": "CloudWatch", "Class": "None" } }, "event": { "duration": 9649888084, "agent_id_status": "verified", "ingested": "2023-06-29T12:21:12Z", "module": "aws", "dataset": "aws.usage" } }, "fields": { "elastic_agent.version": [ "8.8.0" ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "kind-control-plane" ], "host.mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "service.type": [ "aws" ], "host.ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "agent.type": [ "metricbeat" ], "aws.dimensions.Class": [ "None" ], "event.module": [ "aws" ], "host.os.version": [ "20.04.6 LTS (Focal Fossa)" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "host.os.name": [ "Ubuntu" ], "aws.cloudwatch.namespace": [ "AWS/Usage" ], "agent.name": [ "kind-control-plane" ], "elastic_agent.snapshot": [ true ], "host.name": [ "kind-control-plane" ], "event.agent_id_status": [ "verified" ], "aws.dimensions.Service": [ "CloudWatch" ], "host.id": [ "0aab3a64904042bdb1c956d6fe2fa4f1" ], "aws.usage.metrics.CallCount.sum": [ 28 ], "cloud.region": [ "sa-east-1" ], "host.os.type": [ "linux" ], "cloud.account.name": [ "elastic-observability" ], "elastic_agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "data_stream.namespace": [ "default" ], "metricset.period": [ 60000 ], "aws.dimensions.Type": [ "API" ], "host.os.codename": [ "focal" ], "data_stream.type": [ "metrics" ], "event.duration": [ 9649888084 ], "host.architecture": [ "x86_64" ], "metricset.name": [ "cloudwatch" ], "cloud.provider": [ "aws" ], "event.ingested": [ "2023-06-29T12:21:12.000Z" ], "@timestamp": [ "2023-06-29T12:20:00.000Z" ], "agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "host.containerized": [ false ], "ecs.version": [ "8.0.0" ], "host.os.platform": [ "ubuntu" ], "cloud.account.id": [ "627286350134" ], "data_stream.dataset": [ "aws.usage" ], "agent.ephemeral_id": [ "e63bc826-7b49-4d1d-85f0-40340b77461d" ], "agent.version": [ "8.8.0" ], "aws.dimensions.Resource": [ "ListMetrics" ], "host.os.family": [ "debian" ], "event.dataset": [ "aws.usage" ] } } ```Document 2
```json { "_index": ".ds-metrics-aws.usage-default-2023.06.29-000001", "_id": "aaIaB4kBLpMqNjezKDWL", "_version": 1, "_score": 0, "_source": { "cloud": { "provider": "aws", "region": "sa-east-1", "account": { "name": "elastic-observability", "id": "627286350134" } }, "agent": { "name": "kind-control-plane", "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "type": "metricbeat", "ephemeral_id": "e63bc826-7b49-4d1d-85f0-40340b77461d", "version": "8.8.0" }, "@timestamp": "2023-06-29T12:20:00.000Z", "ecs": { "version": "8.0.0" }, "service": { "type": "aws" }, "data_stream": { "namespace": "default", "type": "metrics", "dataset": "aws.usage" }, "host": { "hostname": "kind-control-plane", "os": { "kernel": "5.15.49-linuxkit", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.6 LTS (Focal Fossa)", "platform": "ubuntu" }, "ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "containerized": false, "name": "kind-control-plane", "id": "0aab3a64904042bdb1c956d6fe2fa4f1", "mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "architecture": "x86_64" }, "elastic_agent": { "id": "178edbcb-2132-497d-b6da-e8c7d8095a90", "version": "8.8.0", "snapshot": true }, "metricset": { "period": 60000, "name": "cloudwatch" }, "aws": { "usage": { "metrics": { "CallCount": { "sum": 40 } } }, "cloudwatch": { "namespace": "AWS/Usage" }, "dimensions": { "Type": "API", "Resource": "ListMetrics", "Service": "CloudWatch", "Class": "None" } }, "event": { "duration": 9720431083, "agent_id_status": "verified", "ingested": "2023-06-29T12:21:42Z", "module": "aws", "dataset": "aws.usage" } }, "fields": { "elastic_agent.version": [ "8.8.0" ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "kind-control-plane" ], "host.mac": [ "02-42-AC-12-00-02", "02-42-AC-13-00-04", "06-DD-17-EE-41-97", "22-F1-EB-33-1A-13", "66-56-4C-AB-83-C0" ], "service.type": [ "aws" ], "host.ip": [ "10.244.0.1", "10.244.0.1", "10.244.0.1", "172.18.0.2", "fc00:f853:ccd:e793::2", "fe80::42:acff:fe12:2", "172.19.0.4" ], "agent.type": [ "metricbeat" ], "aws.dimensions.Class": [ "None" ], "event.module": [ "aws" ], "host.os.version": [ "20.04.6 LTS (Focal Fossa)" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "host.os.name": [ "Ubuntu" ], "aws.cloudwatch.namespace": [ "AWS/Usage" ], "agent.name": [ "kind-control-plane" ], "elastic_agent.snapshot": [ true ], "host.name": [ "kind-control-plane" ], "event.agent_id_status": [ "verified" ], "aws.dimensions.Service": [ "CloudWatch" ], "host.id": [ "0aab3a64904042bdb1c956d6fe2fa4f1" ], "aws.usage.metrics.CallCount.sum": [ 40 ], "cloud.region": [ "sa-east-1" ], "host.os.type": [ "linux" ], "cloud.account.name": [ "elastic-observability" ], "elastic_agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "data_stream.namespace": [ "default" ], "metricset.period": [ 60000 ], "aws.dimensions.Type": [ "API" ], "host.os.codename": [ "focal" ], "data_stream.type": [ "metrics" ], "event.duration": [ 9720431083 ], "host.architecture": [ "x86_64" ], "metricset.name": [ "cloudwatch" ], "cloud.provider": [ "aws" ], "event.ingested": [ "2023-06-29T12:21:42.000Z" ], "@timestamp": [ "2023-06-29T12:20:00.000Z" ], "agent.id": [ "178edbcb-2132-497d-b6da-e8c7d8095a90" ], "host.containerized": [ false ], "ecs.version": [ "8.0.0" ], "host.os.platform": [ "ubuntu" ], "cloud.account.id": [ "627286350134" ], "data_stream.dataset": [ "aws.usage" ], "agent.ephemeral_id": [ "e63bc826-7b49-4d1d-85f0-40340b77461d" ], "agent.version": [ "8.8.0" ], "aws.dimensions.Resource": [ "ListMetrics" ], "host.os.family": [ "debian" ], "event.dataset": [ "aws.usage" ] } } ```This issue might be hard to reproduce. When testing, I got the output:
Out of 40000 documents from the index .ds-metrics-aws.usage-default-2023.06.29-000001, 429 of them were discarded.
, which means that this is happening with just 1% of the documents.