elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

[Fleet] Update `agent.version` mapping type in `.fleet-agents` index from `keyword` to `version` #168604

Open criamico opened 11 months ago

criamico commented 11 months ago

Blocked by https://github.com/elastic/elasticsearch/pull/109093

In Kibana versions >= 8.10.x it was found a bug in the response of range queries when the mappings are of keyword type

Original ticket: https://github.com/elastic/kibana/issues/167139

The reason seems to be that version numbers will work correctly as keyword until the numbers have multiple digits in them.

Suggested solutions

  1. Switching the field in question to “version”
  2. Switching the field in question to “version” field type in newly created indices and fixing older indices using a runtime field (see this comment) This option seems a bit "hacky" and might also be slower.

Tasklist

Determine how updating this mapping would impact dashboards, visualizations, etc when agent.version may be mapped as two different types in query results. This would be a breaking change and we should asses the possible impacts across the whole UI before proceeding with a fix.

elasticmachine commented 11 months ago

Pinging @elastic/fleet (Team:Fleet)

kpollich commented 9 months ago

Could we do something like this?

  1. Add a new mapping to the agent index e.g. agent.version_2 (a better name would be great) with the proper version mapping type
  2. Create a migration script to backfill the new mapping with the old values for agent.version

Or, we could update the mapping type on the existing field and run the same kind of migration script to update any existing documents in the .fleet-agents index during a stack upgrade.

criamico commented 9 months ago

Add a new mapping to the agent index e.g. agent.version_2 (a better name would be great) with the proper version mapping type

This would be a good solution as it's not a breaking change. Only downside is that we already have some duplication: agent.version and local_metadata.elastic.agent.version` hold the same info. We could just copy the content of any of those two fields to the new one and keep only the new one as searchable. When https://github.com/elastic/kibana/issues/171425 will be completed we'll be able to control the searchable mappings in an easier way.

Also, we could just call the new field version so when the users search for it don't have to look on nested objects.

Or, we could update the mapping type on the existing field and run the same kind of migration script to update any existing documents in the .fleet-agents index during a stack upgrade.

Not sure if this option would make things slower as was initially mentioned.

MakoWish commented 8 months ago

#167139 is still an issue in 8.11.4, but it is closed and pointing here. The KB here says the fix is to "Upgrade Kibana to 8.10.4 or more recent", but we are on 8.11.4, and the issue is still present.

criamico commented 8 months ago

Hi @MakoWish, https://github.com/elastic/kibana/issues/167139 listed two different bugs. The first one (400 when searching for agent.version: "8.10.0") got fixed with https://github.com/elastic/kibana/pull/168329 and the fix is available since 8.10.4.

The second bug (related to incorrect versions ordering) will be fixed with this ticket, but unfortunately we still haven't got to it.

@jen-huang do you think this could be added to the one of the next sprints?

MakoWish commented 8 months ago

Any plain-text searching fails; not just an agent.version search. We used to be able to search for "Computer1" to find a specific device, but that has now been failing for several months. We are on 8.11.4, and I would have expected that to be fixed now. I have a support ticket open for this as well.

kpollich commented 4 months ago

I was hoping I'd be able to test the impact of changing the mapping type for this property on the fly, but it doesn't look like that's possible, e.g.

PUT /.fleet-agents/_mapping
{
  "properties": {
    "agent": {
      "version": {
        "type": "version"
      }
    }
  }
}

# Response
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_state_exception",
        "reason": """Cannot update mappings in [.fleet-agents-7]: system indices can only use mappings from their descriptors, but the mappings in the request [{"properties":{"agent":{"version":{"type":"version"}}}}] did not match those in the descriptor(s)"""
      }
    ],
    "type": "illegal_state_exception",
    "reason": """Cannot update mappings in [.fleet-agents-7]: system indices can only use mappings from their descriptors, but the mappings in the request [{"properties":{"agent":{"version":{"type":"version"}}}}] did not match those in the descriptor(s)"""
  },
  "status": 500
}

So, to test this we'd need to boot up ES + Kibana both from source which is a bit time consuming. The agent.version property is defined here in ES:

https://github.com/elastic/elasticsearch/blob/14263a78e88c0197cbccf1c94aafda32898b7c30/x-pack/plugin/core/template-resources/src/main/resources/fleet-agents.json#L28-L30

We need to test what happens with integration dashboards that rely on this property when its type is changed, as well as verify that KQL searches around this property behave as expected wrt to version comparisons.

I don't have the bandwidth to continue testing this right now, so I'll bring this along into our next sprint and have on the engineers on the UI & Integrations team take another look.

jillguyonnet commented 4 months ago

Ongoing investigation summary

Reminder: current impact of having agent.version mapped to keyword type

This affects searching and sorting as the version values are compared alphabetically. For instance, given multiple agents on versions 8.8.0, 8.9.0, 8.11.0 and 8.14.0:

The behaviour is identical for local_metadata.elastic.agent.version.

Approach 1: change agent.version mapping type to keyword

This seems to be a fairly involved process. As a reminder, mapping field types for existing fields can't be changed in place, so assessing the impact is not as straightforward as updating the mapping definition. One possible solution could be reindexing the data into a new index with the updated mapping, which at high level would likely involve bumping the current index version, introduce the new mapping with the changes and then reindex the data. This is a risky procedure that should be assessed with more details. Note that @jfreden is working on a similar change and has kindly offered to answer questions (Slack thread). Edit: this work actually populates a new field based on an existing field, so it is a different scenario.

FWIW, I played a bit in the dev tools with reindexing agent data into a new index with mapping identical to .fleet-agents except I changed agent.version type to version. The reindexing happened with no issue and data ordering is correct.

Approach 2: add new version field with version type and backfill it

Suggested in https://github.com/elastic/kibana/issues/168604#issuecomment-1831986672. I imagine this would still require reindexing since we would need to populate it for existing documents, so maybe similar effort as the previous approach. While it would have the advantage of safety, it might lead to confusing behaviour, where version orders differently than agent.version and local_metadata.elastic.agent.version.

Approach 3: add new runtime version field with version type

Suggested in https://github.com/elastic/kibana/issues/167139#issuecomment-1757403193. Probably easier to implement and low risk, but performance could be an issue and it would suffer from the same potentially confusing behaviour as the second approach. Edit: runtime fields don't support the version type, so this would require converting the keyword value into a type with the desired sorting, e.g. a long.

jlind23 commented 4 months ago

@jillguyonnet Not sure I understood the conditions below why does the same condition return different versions?

agent.version < 8.14.0 will only return the agent on version 8.11.0 agent.version < 8.14.0 will return the agents on version 8.8.0 and 8.9.0

jillguyonnet commented 4 months ago

Thanks @jlind23 - that was a typo, I corrected it. 🙂 It should read: agent.version > 8.14.0 will return the agents on version 8.8.0 and 8.9.0

jillguyonnet commented 4 months ago

I've also tested adding a sub-field to the existing agent.version and local_metadata.elastic.agent.version fields. While this is far easier and non-breaking, I'm not sure we'd want a new agent.version.version field, which would order differently than agent.version and local_metadata.elastic.agent.version.

Starting state:

Screenshot 2024-05-24 at 11 32 17

Searching by agent.version > 8.8.1: wrong filtering because agent.version has type keyword:

Screenshot 2024-05-24 at 11 32 37

Searching by agent.version.version > 8.8.1: correct filtering because agent.version.version has type version:

Screenshot 2024-05-24 at 11 32 51
jlind23 commented 3 months ago

Approach 1 seems to be the cleanest one but I understand that it is also the most complex. In that sense, approach 2 is probably the best one even if it implies adding a new field. We could document that there is a change in behaviour and I hope this would avoid issues and SDH being created.

@nimarezainia @kpollich any thoughts on this?

Oscurz commented 3 months ago

Can you but all my elastics kibban data from my Google cloud service back how it was

Dan Hibbitts

On Fri, May 24, 2024, 2:38 AM Jill Guyonnet @.***> wrote:

I've also tested adding a sub-field to the existing agent.version and local_metadata.elastic.agent.version fields. While this is far easier and non-breaking, I'm not sure we'd want a new agent.version.version field, which would order differently than agent.version and local_metadata.elastic.agent.version.

Starting state: Screenshot.2024-05-24.at.11.32.17.png (view on web) https://github.com/elastic/kibana/assets/23701614/0da5512e-ade7-4fe9-a64c-08eee52ea022

Searching by agent.version > 8.8.1: wrong filtering because agent.version has type keyword: Screenshot.2024-05-24.at.11.32.37.png (view on web) https://github.com/elastic/kibana/assets/23701614/47cd9048-55ee-4e55-907b-0ffcb2b31ae2

Searching by agent.version.version > 8.8.1: correct filtering because agent.version.version has type version: Screenshot.2024-05-24.at.11.32.51.png (view on web) https://github.com/elastic/kibana/assets/23701614/0ea32cde-d6e6-4ffa-ab30-fd1fe13d9c3b

— Reply to this email directly, view it on GitHub https://github.com/elastic/kibana/issues/168604#issuecomment-2129084014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWSXWJF6STPGVWT67AMM2XLZD4DBXAVCNFSM6AAAAAA54BPXROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGA4DIMBRGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kpollich commented 3 months ago

Approach number 1 would be my preference here as I think the searching UX will be confusing regardless of how well we document this. Having the search "work" for both agent.version and agent.version.version with different results will not be immediately obvious within the product itself. If the risks/complexity with option 1 are mitigable I think it's worth pursuing as it feels like the best root cause fix here.

jillguyonnet commented 3 months ago

Approach number 1 would be my preference here as I think the searching UX will be confusing regardless of how well we document this. Having the search "work" for both agent.version and agent.version.version with different results will not be immediately obvious within the product itself. If the risks/complexity with option 1 are mitigable I think it's worth pursuing as it feels like the best root cause fix here.

I agree 👍 Ideally, I would like to get a better idea of the kind of Elasticsearch work required in order to migrate to a new index with the correct mapping, which I haven't succeeded in doing in my local setup so far, this to better estimate the effort and properly assess the impact on Fleet and integration assets. FWIW, I think there would be required ES work for backfilling anyway (see e.g. https://github.com/elastic/elasticsearch/pull/108318).

An update to approach 3: I realised that runtime fields don't support the version type (I tested this locally). The suggested workaround involved a long-typed runtime field that converts the version keyword to a sortable number.

Snapshot summary of the approaches discussed so far:

  1. Change mapping type from keyword to version for agent.version and local_metadata.elastic.agent.version
    • Pros: no new field, same UX, would essentially fix the sorting bug
    • Cons: risky, requires involved ES work (new index, reindexing existing data)
  2. Add a new top-level version field with version type
    • Pros: low risk, easy to add (maybe more involved work for backfilling)
    • Cons: new field that populates the agent version in a third place and behaves differently from the other existing two, so guaranteed confusing UX. I wonder if a way to mitigate this could be considering phasing out the existing agent.version and local_metadata.elastic.agent.versionfields, or at least hide them in the Kibana search box.
  3. Add a runtime field
    • Pros: low risk, easy to add in theory
    • Cons: need to convert the version to a sortable type, i.e. long, performance concerns, same confusing UX concerns as 2.
  4. Add a sub-field with version type to the existing agent.version and local_metadata.elastic.agent.versionfields
    • Pros: low risk, easy to add (backfilling to check)
    • Cons: searching for agent.version.version (which orders differently than agent.version) is terrible confusing UX
jlind23 commented 3 months ago

@jillguyonnet @kpollich now that you worked on the investigation, what should be the next step then?

jillguyonnet commented 3 months ago

Adding here the latest information I got from discussing potential Elasticsearch work with @jfreden who is working on a framework for facilitating system index migrations in https://github.com/elastic/elasticsearch/pull/109093.

https://github.com/elastic/elasticsearch/pull/109106 provides an example of how this framework could be used once merged. The guidelines are:

jillguyonnet commented 3 months ago

@jillguyonnet @kpollich now that you worked on the investigation, what should be the next step then?

Ideally, I would have liked to properly test the impact of changing the type of the existing fields. With @jfreden's change mentioned above, this could definitely be possible but not guaranteed to be quick.

Workaround options involving a new field actually sound worse to me than doing nothing, since they wouldn't fix the bug and introduce confusing UX.

I would be in favour of considering the field type change approach, but perhaps delay the next step until https://github.com/elastic/elasticsearch/pull/109093 is merged. It might make sense to break down this followup work as follows:

kpollich commented 3 months ago

+1 on delaying follow-up work here until https://github.com/elastic/elasticsearch/pull/109093 is done.

It might make sense to break down this followup work as follows:

  • A second investigation focusing on performing the type change and reindex locally and test the impact on Fleet and integration assets
  • Assuming the investigation is satisfactory, do the actual system index migration work
  • Perform a QA check in Fleet UI and integration assets and verify that the sorting bug is resolved

This makes sense to me as our path forward. I'm going to take this off your plate, @jillguyonnet and move this into a future sprint while the ES work is still pending. Thanks for your thorough investigation here!

kpollich commented 3 months ago

Updated the description of the issue since the investigation is done. This feels like the best way to maintain visibility on the investigation and path forward, rather than creating another issue for the implementation work 🙂

jfreden commented 3 months ago

Just to follow up from the ES side, the https://github.com/elastic/elasticsearch/pull/109093 PR was closed and we won't work on it now. See full context here. Happy to discuss more if you have any concerns or questions. Thanks!

kpollich commented 3 months ago

Thanks for the context, @jfreden.

SystemIndexMigrationExecutor is very similar, but is handled through an API and a kibana UI when upgrading between major version. As opposed to this approach, the intention of the new framework proposed in this PR, is to keep it completely automated.

Is this something we could use to change this mapping type between major versions, or would that still not be supported? Wondering if we could add this to our list of 9.x breaking changes and punt on it until the next major.

jfreden commented 3 months ago

I think that you could do that. Both the security and watcher indices has done that in the past (6->7, 7->8). Not sure what the upgrade from 8->9 will look like though and I don't have experience with that.