[Ingest Manager] Add constant_keyword value to data_stream.name in template

ruflin commented 4 years ago

The mapping of constant_keyword allows to set the value of it in the template in advance. In templates like logs-nginx.access-* we know in advance that data_stream.type: logs and data_stream.dataset: nginx.access. Because of this we should set it in the template to also enforce it.

This issue is to discuss where adding this should happen. At the moment the stream fields are added to each package/dataset on the registry side but I'm not convinced it is the right place. We could also force package creators to add it and validate for it or add it on the Kibana side.

For the base templates the values are added here: https://github.com/elastic/package-registry/pull/459

ruflin commented 4 years ago

@mtojek @skh @jfsiii Would be interesting to hear your thoughts on this from the perspective of the parts you are working on.

@neptunian Enforce these fields can have the interesting side effect that when a mapping is applied, we already know in advance that there will be a conflict for example because someone ingest logs-foo-default into logs-bar-default (for whatever reason) so install must be treated like an upgrade.

elasticmachine commented 4 years ago

Pinging @elastic/ingest-management (Team:Ingest Management)

neptunian commented 4 years ago

Since we already have other settings in getBaseTemplate() being set, seems like there in Kibana would be a good place for it.

@neptunian Enforce these fields can have the interesting side effect that when a mapping is applied, we already know in advance that there will be a conflict for example because someone ingest logs-foo-default into logs-bar-default (for whatever reason) so install must be treated like an upgrade.

I might be misunderstanding, but i'm not sure I understand how this use case would come about. We already know it's an upgrade based on the version they already have installed or not.

ruflin commented 4 years ago

@neptunian I'm talking about the case where no package was installed yet but data is already in the target index.

neptunian commented 4 years ago

Perhaps we should create a list of these types of edge cases to build, test, QA for. Currently if a package is not already installed we don't try to update an index mapping.

ruflin commented 4 years ago

++ on creating this list. This will also be useful for Elasticsearch to test data stream behaviour and have it part of their tests too (FYI @leehinman ). @neptunian Could you start an issue around this with the ones we know so far?

ruflin commented 4 years ago

Updated the issue with the most recent renaming to dataset fields and unassigned myself.

ph commented 4 years ago

@ruflin @neptunian Is this still an issue?

ruflin commented 4 years ago

Yes, this is still open. But it is something that could also happen in the package: https://github.com/elastic/integrations/issues/226 Undecided on what the better place is. If we do it in the package, we should probably also enforce it in the package spec @ycombinator

ruflin commented 11 months ago

I would like to restart the discussion around this topic again but make it broader. Not only data streams belonging to an integration should have the field set but everything that goes into the data stream naming scheme. This ensures no wrong values are in a data stream an prefiltering on a dataset always works. Here a few example which fail / are possible today.

Not set an has empty result at first

POST logs-test-bar/_doc
{
  "@timestamp": "2023-10-05T13:12:00",
  "message": "My message"
}

# No document is returned
GET /logs-*-*/_search
{
  "query": {
    "term": {
      "data_stream.dataset": "test"
    }
  }
}

POST logs-test-bar/_doc
{
  "@timestamp": "2023-10-05T13:12:00",
  "message": "My message",
  "data_stream.dataset": "test"
}

# 2 documents are returned because it is a constant keyword set on the index
GET /logs-*-*/_search
{
  "query": {
    "term": {
      "data_stream.dataset": "test"
    }
  }
}

If in the above scenario a rollover would have been triggered in the middle, only 1 document would have been returned.

Broken example

In the scenario below, a malicious document sets the wrong value and prevents all future data from being ingested:

POST logs-test-bar/_doc
{
  "@timestamp": "2023-10-05T13:12:00",
  "message": "My message",
  "data_stream.dataset": "foo"
}

# Returns results from the wrong index
GET /logs-*-*/_search
{
  "query": {
    "term": {
      "data_stream.dataset": "foo"
    }
  }
}

# Ingestion breaks
POST logs-test-bar/_doc
{
  "@timestamp": "2023-10-05T13:12:00",
  "message": "My message",
  "data_stream.dataset": "test"
}

Potential solution

Ideally, every template that gets created would have constant keyword value already set. This is not possible today. The alternative is to make sure each document has it set. My understanding is the reroute processor already does something like this today if it is used. @felixbarny Could have such a logic on all data going into the data stream naming scheme?

felixbarny commented 11 months ago

My understanding is the reroute processor already does something like this today if it is used.

That's correct.

@felixbarny Could have such a logic on all data going into the data stream naming scheme?

Yes, we can add an empty reroute processor which would make sure that data_stream.* values are set correctly in the document. While we can do that for integrations and the built-in index templates, users could still create custom index templates without that default reroute processor. While it's not bullet-proof, I think it will cover most cases.

Do you have a sense of how often that issue actually happens in practice? While I agree it's an issue, does it make sense to focus on that right now?

ruflin commented 10 months ago

Do you have a sense of how often that issue actually happens in practice? While I agree it's an issue, does it make sense to focus on that right now?

The broken one, I have only seen 2-3 times in production and was accidental. The part that is concerning me much more is the queries that don't return results. We encourage users to efficiently filter down on dataset values as it is a constant keyword. But the effect can be, that users get no results.

There is a parallel discussion with @felixbarny @Kerry350 around how we efficiently generate the list of datasets (private link). Which API should we use for this? We could do a terms aggs on logs-*-* for data_stream.dataset but if the values are not set, the result returned is not correct.

Does this happen frequently? The query problem I expect it to happen for most logs data that is not shipped via Elastic Agent to the data stream naming scheme as the data_stream.* fields are missing.

ruflin commented 9 months ago

We need to fix this eventually in Elasticsearch. Before we get to this, we could also show some info around this in the dataset quality page. We would go through all the dataset, get the mappings, check if value is set and if yes, if the correct value is set. In case a value is not set yet, we can offer users to "fix" it by adding it to the mapping and template. @yngrdyn

I wrote a very quick python prototype to check the idea, no error checking or anything. This is to share the idea in more detail, not meant to reuse the code:

from datetime import datetime
from elasticsearch import Elasticsearch

client = Elasticsearch(
    'https://localhost:9200',
    verify_certs=False,
    basic_auth=("elastic", "changeme")
)

resp=client.indices.get_data_stream(name="logs-*-*")

for ds in resp['data_streams']:
    # get last index as this is the write index
    # https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-data-stream.html
    last_index = ds['indices'][-1]['index_name']
    data_stream_name = ds['name']
    data_stream_parts = data_stream_name.split("-", 2)

    # get mapping of the current index
    mapping = client.indices.get_mapping(index=last_index)

    # Get values for constant_keyword, check if it is set, compare to 
    # TODO: add a check if the value does, if not, this is an error and offer user to set value
    dataset_value = mapping[last_index]['mappings']['properties']['data_stream']['properties']['dataset']['value']
    namespace_value = mapping[last_index]['mappings']['properties']['data_stream']['properties']['namespace']['value']

    if dataset_value != data_stream_parts[1]:
        # This is a big problem

    if namespace_value != data_stream_parts[2]:
        # This is a big problem

At the moment, this only checks the write index. In case dataset is missing on older indices, we should add it there too make queries faster.

elastic / kibana