[ECS] [TSDB] Centralisation of Dimension Fields

agithomas commented 1 year ago

Scope

Identification of common fields in package that can be moved to ecs as dimensions.
Aggregation of fields that related to service integration and cloud native , if needed.
Identification of ecs repo changes and ecs testing.
Creation of PR

agithomas commented 1 year ago

From the context of packages having ownership of service integration team, there exist certain fields that are part of every package and they are potential candidates of becoming dimension fields.

service.address
host.ip or host.name (Expect duplicates if the host.ip is not public and is a part of a subnet)
agent.id

agithomas commented 1 year ago

When metrics are collected from a resource running in cloud or in a container, below mentioned fields are potential candidates of becoming dimension fields

cloud.instance.id or cloud.instance.name
cloud.provider : This is considering multi-cloud / hybrid infra deployment
cloud.project.id : One organisation can have multiple projects. This is to support multi-regional deployment
container.id : container.name may not be apt here. There may exist pods having same container names running in same host.
host.name: Expect duplicates if the host.ip is not public and is a part of a subnet
agent.id

Should subnet / network name be include ? TBD

ruflin commented 1 year ago

For apps running in k8s, we do wee something like k8s.cluster.name or similar?
Lets assume we specify all the fields above in ECS as dimension fields. In the k8s case, the cloud provider fields would likely not be used. Do this also count then to the max 16 fields even if there are no values? And how does this look, if the fields are set in dynamic templates? (@P1llus )

lalit-satapathy commented 1 year ago

CC @tommyers-elastic @gizas @felixbarny for any suggestions/comments on the TSDB ECS dimension fields. Once we close on these ECS fields, we can raise a PR for the same.

felixbarny commented 1 year ago

@kruskall has tried to define the dimensions for APM (https://github.com/elastic/apm-server/pull/9730) but quickly hit the dimension limit (https://github.com/elastic/elasticsearch/issues/93564).

ruflin commented 1 year ago

I like the direction https://github.com/elastic/elasticsearch/issues/93564 is taking. An initial shortcut might be to just increase the limit to 32 which could already help.

Looking at https://github.com/elastic/apm-server/pull/9730, it seems there is some overlap with the dimensions proposed here but there are also quite a few dimensions which I would argue are potentially unique to apm data. It would be nice if all the default dimensions can be set in ECS (or a common base) but are only used / applied when there is actual data. This goes back to my question: Is the limit reached when a field for the dimension is there or if it is in the mapping itself. If mapping is already enough, will it help to have it in the dynamic template?

For each default dimension define, I would like to see us have a note on why it is a dimension. There went a lot of thought into which dimensions to pick and it should be persisted and shared.

agithomas commented 1 year ago

Can we consider the above mentioned list for Service Integration and for Cloud (https://github.com/elastic/integrations/issues/5193#issuecomment-1422401976). We may have to expect new common fields added to the above list as we test new scenario.

We know, in such cases, explicit dimension field mapping can be done to avoid being blocked.

Is the above list good enough to work towards preparing RFC in ECS ?

agithomas commented 1 year ago

For apps running in k8s, we do wee something like k8s.cluster.name or similar?

I think, having host.ip and container.id will cover the criteria of unique identification of document. Don't you think so?

ruflin commented 1 year ago

Is the above list good enough to work towards preparing RFC in ECS ?

I assume so, you know best :-)

I think, having host.ip and container.id will cover the criteria of unique identification of document. Don't you think so?

Is a container.id globally unique? I assume the chances are low enough for conflicts to have this work.

agithomas commented 1 year ago

Is a container.id globally unique? I assume the chances are low enough for conflicts to have this work.

There are no references i could find that says a container.id repeats in a cluster.

I understand the scenario you are referring to - the scenario where multiple k8s cluster are provisioned. Cluster name is a good logical segregation in such cases.

Reference : https://cloud.google.com/stackdriver/docs/solutions/gke/observing

It may then be checked

Are the ways to get the cluster name in GKE / EKS / Self managed installation different or same?
Are these information currently captured ? If yes, which ecs field is currently used ?

These may be the questions that can be asked to the owners of GKE / EKS integration team.

agithomas commented 1 year ago

It may then be checked

Are the ways to get the cluster name in GKE / EKS / Self managed installation different or same?

Are these information currently captured ? If yes, which ecs field is currently used ?

These may be the questions that can be asked to the owners of GKE / EKS integration team.

My preference would be not to modify the ecs dimension fields based on an application or cluster technology / deployment architecture. Every technology such aaws.kinesis, gke , eks may have its own identifier a unique resource it is monitoring. In akw.kinesis it is aws.dimensions.streamName.

As part of integration enhacement to use TSDB, it is expected these unique fields that represent a resource is identified carefully as dimension field in the integration.

agithomas commented 1 year ago

data_stream.namespace is an important one too. Adding it to the confirmed list.

ruflin commented 1 year ago

Can you share some details why data_stream.namespace is an important dimension? If this value changes, the data goes into a different index.

agithomas commented 1 year ago

Can you share some details why data_stream.namespace is an important dimension? If this value changes, the data goes into a different index.

As you rightly said, this is unnecessary.

lalit-satapathy commented 1 year ago

@agithomas,

Can we summarise the final list of ECS fields which are dimensions below and one-line description for each, providing the rationale?

felixbarny commented 1 year ago

When the dimension limit is removed (https://github.com/elastic/elasticsearch/issues/93564), can we just make any non-metric (keyword?) field a dimension by default. I don't see the value of spending our time on finding out what good dimension fields are. Other TSDBs only support two types of fields: metrics and dimensions. Can we just operate under the same mental model?

lalit-satapathy commented 1 year ago

@felixbarny,

Its a good point, in particular, it's not very clear what is the difference between not-dimension meta fields which are keywords vs. dimension fields. TSDB documents will primarily contain a combination of metric fields, dimension fields and meta fields. From a document query point of view, assuming dimension fields and meta fields behave the same.

Currently missing any specific details, which links number of dimension fields vs. TSDB size/performance. I am assuming there is a relationship. Hoping someone from ES team, can provide more details on this.

In the mean time, we can just continue to annotate dimensions, as is this is the ask for TSDB enablement.

felixbarny commented 1 year ago

@martijnvg could you give us some guidance on the impact of having a lot of dimensions, assuming the _tsid is a hash and there are no size restrictions. See also https://github.com/elastic/elasticsearch/issues/93564#issuecomment-1437147904.

What if any negative consequences do we need to expect if we declare too many dimensions or when making all non-metric fields a dimension by default?

Note that this is the default in other TSDBs so if there are negative consequences in ES when treating non-metric fields as dimensions by default, I'd be curious to have your thoughts on whether they're tolerable, and if not, what we could do to minimize the impact so that we can work with ES like with any other TSDB.

martijnvg commented 1 year ago

@felixbarny I need to think more about this.

We might end up with a default dynamic mapping in where every keyword field or every non-metric field (everything except for counter, gauge, or histogram) in order to support dynamic user-defined metrics. (from https://github.com/elastic/elasticsearch/issues/93564#issuecomment-1437147904.)

How are keyword labels modelled in this model?

felixbarny commented 1 year ago

Keyword labels would be mapped as a dimension. By default, everything except actual metrics would be a dimension.

agithomas commented 1 year ago

While we discuss the limitations and the possible future enhancements, i would like to freeze the ecs fields which must be marked as dimension fields.

host.ip
service.address
agent.id
cloud.project.id
cloud.instance.id
cloud.provider
container.id

ruflin commented 1 year ago

Lets separate immediate changes from future plans. TSDB is to be released soonish and we want to adopt it in integrations to also make sure it all works as expected. This is where we need the list from @agithomas . These are all ECS fields and if we add it to ECS, all integrations will have these dimensions by default as soon as ECS is updated. Everything using ECS will have a field annotated as dimension from there on, but as long as TSDB is not enabled, it wont have any effect. @agithomas List LGTM

Then there is the mid term and long term and I agree with @felixbarny , ideally we should not have to think about dimensions at all but this will likely not happen immediately. I suggest to keep the "no dimension" discussion in the Elasticsearch issue.

agithomas commented 1 year ago

host.ip

service.address

agent.id

cloud.project.id

cloud.instance.id

cloud.provider

container.id

Based on the recommendations, host.ip will be replaced by host.name.

The new list will be

host.name
service.address
agent.id
cloud.project.id
cloud.instance.id
cloud.provider
container.id

agithomas commented 1 year ago

@lalit-satapathy ,can you please help by approving , if there are no further queries?

lalit-satapathy commented 1 year ago

@agithomas,

Lets update the TSDB migration document to change from host.ip to host.name

lalit-satapathy commented 1 year ago

Let's fix the already done packages (example: oracle) to update from host.ip to host.name, where ever applicable.

agithomas commented 1 year ago

Let's fix the already done packages (example: oracle) to update from host.ip to host.name, where ever applicable.

PR Link : https://github.com/elastic/integrations/pull/5697

agithomas commented 1 year ago

@agithomas,

Lets update the TSDB migration document to change from host.ip to host.name

PR Link : https://github.com/elastic/integrations/pull/5706

agithomas commented 1 year ago

Reopening the issue based on the values observed from recent testing across multiple cloud providers.

agithomas commented 1 year ago

The comment below is ONLY from the context of service integration packages.

Existence of values for various `cloud.*` fields across different cloud providers	Field	AWS	Azure
`cloud.project.id`	No	No	Yes
`cloud.account.id`	Yes	No	Yes
`cloud.region`	Yes	Yes	No
`cloud.availability_zone`	Yes	No	Yes

Based on these information, following is the recommendation

Use cloud.account.id instead of cloud.project.id as dimension field

Open points

Should
- cloud.availability_zone ( or )
- cloud.region + cloud.availability_zone be added as dimension fields ?
Does azure not having values for cloud.availability_zone be an issue?

agithomas commented 1 year ago

cloud.instance.id and it's dependency on cloud.region and cloud.availability_zone is as below

Field	AWS	Azure	GCP
`cloud.instance.id`	`cloud.region`	No Dependency	`cloud.availability_zone`

agithomas commented 1 year ago

It is identified that the values of cloud.region not appearing for GCP and cloud.account.id not appearing for Azure mentioned here is because of a bug. A PR will be created to fix this problem.

tetianakravchenko commented 1 year ago

It is identified that the values of cloud.region not appearing for GCP and cloud.account.id not appearing for Azure mentioned https://github.com/elastic/integrations/issues/5193#issuecomment-1531467427 is because of a bug. A PR will be created to fix this problem.

PR for Azure - https://github.com/elastic/beats/pull/35302 PR for GCP - https://github.com/elastic/beats/pull/35300

tetianakravchenko commented 1 year ago

as discussed with @agithomas:

It is recommended to include all those fields below to be on the safe side. But package developers can choose to pick a subset out of the recommended list after analyzing possible impacts.
For now this list is not enforced and not set as a default dimensions list. In the future it might be changed on the ECS side.

For packages that can be deployed on cloud/on-prem/k8s (examples - MongoDB, Nginx)

Field name	Explanation/reasoning
host.name	It is a host where the agent is running. Mainly used for cases when integration is installed on-prem. Note: we are not using host.id since it might be not unique
service.address	This field is present in case we provide a concrete target for scraping, like IP:PORT of some service (like mongodb), in some cases - it might be not needed to use this field
container.id	For now it is mainly used for: docker/containerd/kubernetes packages, this field is empty for other integrations. Container.id in this case is an id of the monitored container
cloud.account.id (new)	id used to identify different entities in a multi-tenant environment.
cloud.provider	To avoid minimal chance that account.id might be the same for different providers
cloud.region (new)	For services that are region specific
cloud.availabilityzone (new)_	For services that are zone specific
cloud.instance.id	host.name (can be defined manually by customer) is not unique enough. for Azure - instance.id is globally unique, AWS - region, GCP - availability zone. Technically for azure for example it would be enough to define cloud.instance.id only, but since it should be unified we include all fields: cloud.region/zone
agent.id	For cases when 2 data shipers are monitoring the same resource
cloud.project.id (deleted)

For Cloud-only Integration Packages / Managed Services Packages ( examples - AWS S3 )

Field name
cloud.account.id	required	cloud id used to identify different entities in a multi-tenant environment
cloud.region	required	For managed services that are region specific
cloud.availability_zone	required	For managed services that are zone specific
cloud.provided	Not needed	Because package specific fields already covers it
agent.id	Required	For cases when 2 data shipers are monitoring the same resource

agithomas commented 1 year ago

The above recommended list is based on based on the understanding we presently have on

dimensions in managed services running in cloud
dimensions a service / product need when running in on-prem, public-cloud in monolithic ,microservice manner.
based on the available fields in ECS.

The above list may be needed when more fields are added to the ECS & used. For example - details of the subnet (for on-prem infrastructure).

The above list will be used to prepare the RFC-1 of RFC-0

@ruflin , @felixbarny , @martijnvg

Kindly help by reviewing the new list mentioned here

ruflin commented 1 year ago

I stumbled over the following line.

agent.id: For cases when 2 data shipers are monitoring the same resource

If 2 agents are monitoring the same resource, shouldn't it be the same time serie? Can you provide an example on where this happens, this likely clarifies things.

agithomas commented 1 year ago

If 2 agents are monitoring the same resource, shouldn't it be the same time serie?

We can have one policy deployed on any number of agents. This permits two agents monitoring same resource. This may be done intentionally or accidentally by the customer.

Case 1: If intentionally, it is important that agent.id should be part of a dimension field so that data can be recorded as separate timeseries. A valid usecase i can think here is - a standalone elastic-agent may be running on single node monitoring several infra assets. The admin on understanding a problem related to disk or over-utilisation choose to migrate to a different system. As part of cut-over, during maintenance window, it is important that the user verifies data received from new agent is consistent . Without including agent.id, the data in ES from new agent will be recorded in staggered manner.

Case 2: If agent policy is installed accidentally on more than on agents, is elasticsearch expected to do the de-duplication making use of dimension field constraint (not a feature) of timeseries database ? We think, It may be best that a datastore is a true representation of data received from the upstream system, in this case integration packages.

agithomas commented 1 year ago

@ruflin , I have mentioned here , the reason why the agent.id must be included.

Do you think these usecases and scenario are valid to include agent.id? Or, should we consider it as exceptions and save a few bytes of _tsid field by removing agent.id ?

ruflin commented 1 year ago

At the moment, I would rather opt for too many then too few dimensions so I'm good with the approach.

botelastic[bot] commented 2 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

elastic / integrations

[ECS] [TSDB] Centralisation of Dimension Fields #5193