Closed mostlyjason closed 2 years ago
Pinging @elastic/fleet (Team:Fleet)
Just wanted to share an experience I had with a 7.16 snapshot in Cloud staging. For quick demos, I typically add a system integration, choose the default config and run with it. This is the first time I've received this error. I'm sure it's intentional, but I'd imagine many folks will just click through when onboarding to Elastic for the first time.\
There is already a package with the same name on this agent policy
is the error received. First, I don't believe we reference packages anywhere anymore. For a new user, this error doesn't make much sense. Second, if this is in fact primarily an issue with duplicate names or ID's, can the system integration that's automatically installed with Kibana simply have a different name than the default value? Not sure if this is a quick fix, but if so, it might make sense to update for 7.16 as I imagine this integration could be used more.
@alexfrancoeur this is a regression we captured in https://github.com/elastic/kibana/issues/116475. We have a PR close to landing that fixes the regressions, as well as guarantees globally unique policy names here: https://github.com/elastic/kibana/pull/115212
Awesome, thanks for the update @kpollich !
@mostlyjason @joshdover I went through the issue and comments in linked issue. Here is my analysis so far, let me know your thoughts.
Current state after Fleet setup:
Default policy
has system integration by default (more than 100 assets)Default fleet server policy
/ Elastic Cloud agent policy
has fleet-server
integration by default (no assets)elastic-agent
package is installed, not assigned to agent policy (6 assets)Proposal:
Phase 1: Default policy
with system
package and elastic-agent
package
Both on-prem and cloud:
Do not create Default policy
on setup
Add Agent flyout
: when no agent policies exist, add a link to Create agent policy flyout
Add integration
: see improvements in design doc (New hosts/Existing hosts)
Create Agent policy flyout
: see improvements in design doc
elastic-agent
package: do not install by default, only with first agent policy, see above
Phase 2: Default fleet server policy
with fleet-server
package
On prem:
Default fleet server policy
by defaultCloud: should we keep Elastic Cloud agent policy
as is? Since Fleet Server
is managed by cloud, does it make sense to delay the setup?
Confirmed that cloud flow shouldn't change, we should make sure to install fleet-server integration on cloud by default (logic might have to move since currently this is done by fleet startup)
Additional tasks:
EDIT: saw the UX design doc, commented a few things there. Overall the proposed design makes sense to me.
@juliaElastic I have some questions about your task list. I'll schedule a meeting to review it with Dmitry.
@mostlyjason I added my comment before seeing the design initially, updated now to reflect my understanding after seeing the designs.
In general I'm supportive of the change to create the policy on demand. It is possible that for some users this is considered a breaking change. Assuming someone had automation in place for setting up Kibana and assume the policy was there by default, this would break now. What is our recommendation for these users?
For the cloud change: What happens exactly when a the system package is removed and someone already has <8.1 running? Will the preconfigure API now remove the packages not listed anymore / cleanup or just ignore it?
@ruflin For the first point, I think we can recommend these users to add a step to their automation to create the policies first (can be done with API, as I have done in our automation tests).
As for the cloud change, I think the change is not destructive, it will ignore the system package or default policy if exists.
@joshdover @mostlyjason please correct me if I'm wrong.
we can recommend these users to add a step to their automation to create the policies first
The problem with this is that a user upgrading from 8.0 to 8.1 will assume no changes to the automation are needed. Ideally there is a migration path that works across a few versions meaning the user can do the change to the automation in 8.0 and the changes don't have to be made at the same time as the upgrade. Likely by already coding in the id this should be possible?
@ruflin I'm not quite following, could you rephrase?
Lets assume we have a user X. This user has created all the automation to enroll Elastic Agents in 8.0. Now the stack is upgraded to 8.1 and more Elastic Agents need enrolling or a fresh cluster is setup. Now suddenly the automation that X built is not working anymore. Instead, the user should get a message in 8.1 that some parts of the scripts should change and only a few iterations later we can make the breaking change.
@ruflin Thanks, I understand now. So are you suggesting that we can't remove the default policies in 8.1, as it might break automation scripts? This means the whole feature has to be delayed? I'm concerned since this feature is part of our 8.1 priorities. What should we do @joshdover @mostlyjason @jen-huang ?
My main point is to make sure this has been thought through from a user upgrade / migration perspective.
Lets assume we have a user X. This user has created all the automation to enroll Elastic Agents in 8.0. Now the stack is upgraded to 8.1 and more Elastic Agents need enrolling or a fresh cluster is setup. Now suddenly the automation that X built is not working anymore.
Can you explain how this affects agent enrollment on upgraded clusters? I believe we are not deleting any agent policies so users should be able to continue enrolling agents into the "default policy". We're not creating new default policies, but that shouldn't affect existing policies.
For the scenario when a new cluster is set up using automation, if they attempt to use the Fleet API to enroll agents, they will get an error saying no agent policy exists. However, our API is still marked as experimental. We don't even have official docs for our API, so I imagine its not commonly used. It seems like a change we could document in our release notes. Am I overlooking a use case?
@juliaElastic Can you confirm that this change does not affect cloud deployments with stack version < 8.1? The preconfiguration is versioned along with the stack?
Lets assume we have a user X. This user has created all the automation to enroll Elastic Agents in 8.0. Now the stack is upgraded to 8.1 and more Elastic Agents need enrolling or a fresh cluster is setup. Now suddenly the automation that X built is not working anymore. Instead, the user should get a message in 8.1 that some parts of the scripts should change and only a few iterations later we can make the breaking change.
Can you explain how this affects agent enrollment on upgraded clusters? I believe we are not deleting any agent policies so users should be able to continue enrolling agents into the "default policy". We're not creating new default policies, but that shouldn't affect existing policies.
I think Jason is correct for the upgraded cluster scenario. We are not deleting any existing policies (default or otherwise) so any user automation should still work. There is the chance that their automation relies on the ability of Elastic Agent to pick up the "default policy" (without explicitly passing along a policy ID), but the flags that tell the agent which policy is the default policy is not being removed at this time as discussed in https://github.com/elastic/beats/issues/29774.
@mostlyjason @jen-huang Thanks for your thoughts, you are right, existing policies are not being changed/removed, so on upgrade, we are not expecting any breaking change. Only for new deployments, users would need to add a step to create a policy before enrolling agents.
Cloud deployments with version < 8.1 will not be affected, the change will be conditional for >= 8.1 versions only: https://github.com/elastic/cloud-assets/pull/912/files
I agree to document this change in release notes.
Ok, it sounds like only fresh clusters will hit an automation issue here.
However, our API is still marked as experimental. We don't even have official docs for our API, so I imagine its not commonly used.
I assume you referring to the title here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/common/openapi/README.md I think this something we should urgently fix, my take is only that the OpenAPI spec is experimental, not the APIs itself. One of the core principles when we built Fleet is to always have everything based on API's. Every feature that goes GA means the API must go GA tot. We use the APIs for internal automation quite a bit and I expect users to do the same. If all our API's are experimental it would mean Fleet itself is experimental for me.
Thinking about this again, this will very likely break ECK (@david-kow ) and elastic-package (@mtojek @jsoriano ). Before this changes goes in, please coordinate with these teams to make sure their services are not suddenly broken.
Update: Likely also the e2e test suite (@mdelapenya ) and our test clusters (@kuisathaverat ) will be affect by this.
One more dependency was raised: Azure VM extension relies on default policy to be present: https://github.com/elastic/azure-vm-extension It has to be updated to create a default policy with API if not exists. (@ravikesarwani )
Here is the minimal API call to create a policy (with system and agent monitoring):
POST kibana_host/api/fleet/agent_policies?sys_monitoring=true
kbn-xsrf: kibana
{"name":"Agent policy 1","namespace":"default","monitoring_enabled":["logs","metrics"]}
One more dependency was raised: Azure VM extension relies on default policy to be present: elastic/azure-vm-extension It has to be updated to create a default policy with API if not exists. (@ravikesarwani )
Here is the minimal API call to create a policy (with system and agent monitoring):
POST kibana_host/api/fleet/agent_policies?sys_monitoring=true kbn-xsrf: kibana {"name":"Agent policy 1","namespace":"default","monitoring_enabled":["logs","metrics"]}
This requires that the user used by the VM extension has access to the Kibana/Fleet API for creating an agent policy. This user can either be a superuser or, assuming https://github.com/elastic/kibana/pull/122347 makes it for 8.1 as well, the user can have Fleet privileges granted. I'm not sure what the current privilege requirements are today, but I hope this wouldn't be a problem.
I suspect it should work since I'm guessing the extension needs to make Kibana/Fleet API calls today for retrieving the default policy ID. I also think it would be best for this extension to not use a default policy or rely on any default data being setup by Fleet first. Instead this extension should create it's own agent policy with a well-known name or ID that is only used by this extension.
I assume you referring to the title here:
main
/x-pack/plugins/fleet/common/openapi/README.md I think this something we should urgently fix, my take is only that the OpenAPI spec is experimental, not the APIs itself. One of the core principles when we built Fleet is to always have everything based on API's. Every feature that goes GA means the API must go GA tot. We use the APIs for internal automation quite a bit and I expect users to do the same. If all our API's are experimental it would mean Fleet itself is experimental for me.
I think we need to get alignment on how our API is being used today and the intended goals of it. We are seeing quite a bit of internal usage of this API in test suites and other automation and breakages are causing a large amount of disruption. A path to GA for the API will be shaped by these requirements, but I've put together a new issue with the very basics here as a starting point. This should allow us to shore up the API and start making incremental improvements to supporting a stable API: https://github.com/elastic/kibana/issues/123150
@narph Would be able to comment on how feasible it is for the VM extension to make this API call above?
Thanks I wasn't aware that our API was used in this way for internal users. It'd be good to prevent/avoid disruption if we can.
@joshdover and @juliaElastic is there any workaround/shim we can provide API users so that this is not a breaking change for them? For example, could we generate an agent policy on demand if they make a request that seems to expect/assume the old way of generating default policies? We could add a deprecation notice to the logs when they make this kind of request. Would this create any unwanted side effects?
@mostlyjason From what I've seen in usages, the API call being made to Fleet is query all agent policies. The logic that tries to find the default policy, is internal to the calling app. So I think this API call is too generic to assume someone is looking for a default policy, unless we can identify the caller by origin/user-agent or some other way. However I don't really like this approach of potentially having a side effect of a GET API call.
I think we need to get alignment on how our API is being used today and the intended goals of it. We are seeing quite a bit of internal usage of this API in test suites and other automation and breakages are causing a large amount of disruption. A path to GA for the API will be shaped by these requirements, but I've put together a new issue with the very basics here as a starting point. This should allow us to shore up the API and start making incremental improvements to supporting a stable API: #123150
I admit I haven't read the entire thread, but last time I ignored a similar one, it resulted in removing an API we used in elastic-package.
FYI we're using these APIs, to manage package policies, agent policies while executing system tests.
AFAIR there is also a similar usage on the APM side.
@mtojek the APIs are not changing as part of this feature, however we are removing the default policies as part of setup (except for Elastic Cloud agent policy in cloud). So, in case your project relies on a default policy being present, it should be changed to create a policy first with the API, and use that. e.g.
POST kibana_host/api/fleet/agent_policies?sys_monitoring=true
kbn-xsrf: kibana
{"name":"Agent policy 1","namespace":"default","monitoring_enabled":["logs","metrics"]}
Due to the dependencies in this change (primarily ECK and the Azure VM extension for MP++, but also our internal testing automation), it's possible that we may need to delay this change until those components have had time to update for these changes. Certainly a learning opportunity for us here, we'll need to consider these type of dependencies much earlier in our planning process going forward.
I would like to open up the discussion about retaining the Default Fleet Server policy as it may provide us some benefit in decoupling this change without any (?) downside on the UX goals of this change. Here's my reasoning here:
fleet_server
package does not install assets by default, its only purpose is to provide a way for users to edit and update configuration for Fleet Server.fleet_server
package in Cloud as APM & Fleet are deployed by default and require this package.fleet_server
package back to the xpack.fleet.agentPolicies
settings in kibana.yml
. This essentially emulates the behavior we had before to create this policy by default.fleet_server
package for these users since it currently contains no assets.In the future, the fleet_server
package may contain assets. If and when it does, we would want to make sure not to install this package unless Fleet is being used (again, this only applies to on-prem users). If retaining the Default Fleet Server policy eliminates some of the changes needed by these dependencies and allows us to unblock shipping this in 8.1, I think we can delay on solving this for on-prem until we actually need to (when this package contains assets). We can also explore alternative solutions like:
This should only be considered if this would actually help eliminate some of the necessary work on ECK or other downstream components, which I'm still unclear on. For example, I think ECK will already need to handle the lack of the regular default policy, as it appears that the Elastic Agent DaemonSet support may require it. We'll need clarification from @elastic/cloud-k8s on how this works today before we can/should consider this. I just wanted to get this discussion started on this option if in the case it would allow us to move forward with this change sooner.
@mtojek the APIs are not changing as part of this feature, however we are removing the default policies as part of setup (except for Elastic Cloud agent policy in cloud). So, in case your project relies on a default policy being present, it should be changed to create a policy first with the API, and use that. e.g.
That's a major change in the elastic-package and we have to plan for it. We depend on the Compose stack, so we prefer to configure Kibana/Fleet and Fleet Server via environment variables. We use two policies: default one for integrations, and the default one for the Fleet Server.
We REALLY wouldn't like to end up intercepting the Compose booting procedure with extra API calls.
BTW we can extract this discussion into a separate issue, dedicated to elastic-package.
@mtojek as discussed on chat, instead of API you can use preconfiguration in kibana.yml
config file:
https://www.elastic.co/guide/en/kibana/master/fleet-settings-kb.html
xpack.fleet.packages:
- name: system
version: latest
- name: elastic_agent
version: latest
- name: fleet_server
version: latest
xpack.fleet.agentPolicies:
- name: Agent policy 1
description: Agent policy 1
is_managed: false
namespace: default
monitoring_enabled:
- logs
- metrics
package_policies:
- name: system-1
id: default-system
package:
name: system
- name: Fleet Server policy preconfigured
id: fleet-server-policy
namespace: default
package_policies:
- name: Fleet Server
package:
name: fleet_server
Thank you, @juliaElastic and @joshdover, for the guidance. I created the issue and linked it here as a blocker.
@mtojek thanks, one correction on your issue, it doesn't impact 8.0.0-SNAPSHOT, only 8.1 if we merge as planned.
My bad, you're right here, thanks for the correction! I'm wondering if we can already apply this change for 8.0.0-SNAPSHOT branch. Do you think it will work too?
@mtojek yes, you can add new preconfigured policies on 8.0.0 or main branch, it just means new policies will be created next to the default ones.
Thanks for the heads up.
Some thoughts:
elastic-agent container
command. If at any point we need to add an API call from the operator, I'd consider this a significant change (building new channel, obtaining/handling credentials, retry/failure policies, bubbling the results up to the user, etc.).
- Today, ECK operator doesn't communicate directly with Kibana/ES for Fleet configuration - we rely on
elastic-agent container
command. If at any point we need to add an API call from the operator, I'd consider this a significant change (building new channel, obtaining/handling credentials, retry/failure policies, bubbling the results up to the user, etc.).
Good to know. I think we can track the more ideal solution as a separate enhancement that doesn't need to block this.
- Was a setting to opt-in (or opt-out) of this new behavior considered? If the operator (or user) is responsible for putting the default policies in the config, the same party will be responsible for its maintenance (what if fields/defaults change between Kibana versions or versions of the integration?) It seems it would be easier to maintain those "closer" to the changes.
The goal of this change is to change the default behavior, so we don't want to require an opt-in. Our existing kibana.yml
configuration is the officially supported way to opt-in to the old behavior (and bootstrap Fleet configuration more generally). Configuration keys here are considered part of Kibana's stable API and we are careful not to introduce unintended breaking changes.
- The only issue I see is that this change makes our config not truly declarative, ie. a working config from cluster created before that change will keep on working, but if users will try to duplicate/recreate a cluster using it, then they will run into issues.
- I think the best course of action would be to document this for ECK 2.0 release and modify our config examples accordingly - we have now Remove dependency on Fleet Default policy cloud-on-k8s#5262 to track this, so we can discuss it there.
This makes sense to me, since ECK 2.0 is going out with Stack 8.0. Though this default behavior isn't slated to land until 8.1, users should be able to start supplying this config in 8.0 and it would continue working in 8.1+. I think this is a good solution to the duplicate/recreate cluster scenario you mentioned above.
With a viable path forward for ECK, I think the only critical blocker at this point is the Azure VM extension: https://github.com/elastic/azure-vm-extension/issues/11. Still waiting to hear back from the team here on whether they will have capacity to complete this for 8.1 cc @masci @ravikesarwani
It seems the other internal usages, e2e-testing (@juliaElastic could you open an issue for this project) and elastic-package should be able to easily adopt the preconfiguration solution to this problem without any adverse effects or large effort.
@joshdover created this issue for e2e-testing: https://github.com/elastic/e2e-testing/issues/2039 cc @mdelapenya
created one more request for observability-test-environments @kuisathaverat
Is there a way to have the YAML equivalent to the UI configuration? for example to APM integration
I've started to configure this and make the use of fleet more complicated than it is, IMHO to add a policy if you going to use the default settings is complicated to justify, for example, if you want to add an APM integration this can be the configuration
xpack.fleet.agents.enabled: true
xpack.fleet.packages:
- name: system
version: latest
- name: elastic_agent
version: latest
- name: apm
version: latest
- name: fleet_server
version: latest
xpack.fleet.agentPolicies:
- name: Apm Agent policy
id: apm-agent-policy
description: Agent policy with APM and System logs and metrics enabled
is_managed: false
namespace: default
monitoring_enabled:
- logs
- metrics
package_policies:
- name: system-1
id: default-system
package:
name: system
- name: apm-1
id: default-apm
package:
name: apm
<-- Here, you have to add all the APM configuration that I do not have any idea how it looks in YAML but looks like a long piece of YAML -->
- name: Fleet Server policy preconfigured
id: fleet-server-policy
namespace: default
package_policies:
- name: Fleet Server
package:
name: fleet_server
inputs:
- type: fleet-server
keep_enabled: true
vars:
- name: host
value: 0.0.0.0
frozen: true
- name: port
value: 8220
frozen: true
- name: Fleet Server policy preconfigured
id: fleet-server-policy
namespace: default
package_policies:
- name: Fleet Server
package:
name: fleet_server
inputs:
- type: fleet-server
keep_enabled: true
vars:
- name: host
value: 0.0.0.0
frozen: true
- name: port
value: 8220
frozen: true
Can we simplify the configuration allowing to choose to create the default policy? by adding a couple of parameters (one required the other optional) we can create those policies, and the configuration is much easier.
xpack.fleet.agents.enabled: true
xpack.fleet.packages:
- name: system
version: latest
- name: elastic_agent
version: latest
- name: apm
version: latest
create_default:
name: APM default policy
id: apm-default-policy
- name: fleet_server
version: latest
create_default:
name: Fleet Server default policy
id: fllet-server-default-policy
Other alternative is to have a configuration generator in the UI that allows you to create a policy with all integration you need, then generate the YAML, then I only have to copy and paste in the Kibana.yml file, it will be a long piece of YAML but at least I do not have to create it manually. It will continue to be hard to read, edit, and without a way to validate that the syntax is correct other than launch Kibana that is not a quick operation.
Can we simplify the configuration allowing to choose to create the default policy? by adding a couple of parameters (one required the other optional) we can create those policies, and the configuration is much easier.
This feels like a pretty reasonable request and would mirror the improvements we did on the API in https://github.com/elastic/kibana/pull/119739. @juliaElastic WDYT, should we open an issue for this?
@kuisathaverat Good point that adding the defaults to preconfig is not that simple.
How is the APM integration added currently? The Fleet change only removes the default policies, APM was not included in defaults before.
I like the idea of create_default
in preconfig (we need a way to distinguish normal policy from fleet server policy), should we raise this for 8.2?
How is the APM integration added currently? The Fleet change only removes the default policies, APM was not included in defaults before.
This is a good question, I don't see how this change impacts setting up any APM policies.
I like the idea of
create_default
in preconfig (we need a way to distinguish normal policy from fleet server policy), should we raise this for 8.2?
We'll have to see on timing, but I think an issue to start collecting use cases would be helpful.
Created this: https://github.com/elastic/kibana/issues/124030 We should be careful with the naming, since we want to move away from the concept of default policies.
How is the APM integration added currently? The Fleet change only removes the default policies, APM was not included in defaults before.
This is a good question, I don't see how this change impacts setting up any APM policies.
To start the APM server now you use, and Elastic Agent and the APM integration, it is true that the current default policy does not have the APM integration. To have a way to create a policy for an integration with the default values in a single config step simplifies the configuration. It is not directly related to remove the default policies but it could reuse the work in https://github.com/elastic/kibana/issues/124030 The use case could be something like:
xpack.fleet.packages:
- name: elastic_agent
version: latest
- name: apm
version: latest
create_default:
name: APM default policy
id: apm-default-policy
apm-default-policy
ready to use with an Elastic AgentI am thinking on environments like our test environments that are configured as code for everything.
I see, so this use case is similar to what we added on UI to create a new agent policy in Add integration flow.
I think we have to consider that users may want to add multiple packages to an agent policy with preconfig (e.g apm and system). It could be represented as something like this:
xpack.fleet.agentPolicies:
- name: Agent policy 1
id: agent-policy-1
add_packages: system, apm
Instead of introducing a new concept can we allow the actual preconfiguration to use default inputs like we now do in the package policy API
xpack.fleet.agentPolicies:
- name: Apm Agent policy
id: apm-agent-policy
description: Agent policy with APM and System logs and metrics enabled
is_managed: false
namespace: default
monitoring_enabled:
- logs
- metrics
package_policies:
- name: system-1
id: default-system
package:
name: system
- name: apm-1
id: default-apm
package:
name: apm
that's a good point, package_policies inputs are optional, and come with default values
Thanks all for the feedback here. Definitely agree that using the default values for policy inputs makes sense. Let's move this discussion over to the issue that Julia created (#124030) to avoid confusing folks who are following this issue which is only tangentially related.
@joshdover @jen-huang @mostlyjason We got confirmation today about the blockers, and it seems we are good to go. Are you okay with merging this feature then for 8.1? The remaining dependencies can be finished after FF.
@juliaElastic I'm comfortable moving forward with merging this for 8.1 based on those statuses. Let's be sure to be available to support any teams that need any additional help, but I know you've already been super on top of that 😄
Currently, the system, elastic agent, and fleet server packages are automatically installed. The system integration is also added to the default agent policy in Fleet. This creates problems for the onboarding flow in other solutions and products https://github.com/elastic/kibana/issues/82851. It's also not the best UX because we are adding these integrations without explicit consent from the user.
What integrations should be installed by default, if any? If not, when should they get installed? How should this work in self-managed clusters, and on cloud where the APM & Fleet node is added by default? How should it work for standalone agents?
In particular, the system integration should be more explicit. That also means that we won't install the integration by default but instead when the user takes action to install it. It'd be still be nice to encourage users to add these packages as a useful way to get started or monitor hosts, but it should still be their choice. Additionally, it's not obvious that the system integration is the one users should install since we have separate integrations for linux and windows. Some proactive prompting should help users get started.
Potential places to add the system integration include when the user is adding their first agent in Fleet (potentially creating their first agent policy instead of using a default), adding their first integration (should we add it here or not), and creating a new agent policy.
Here it is not explicit choice:
Here it is an explicit choice:
Also, users are prevented from removing or reininstalling these default packages https://discuss.elastic.co/t/reinstall-system-integration-assets/283140. If we remove their special status, then users should have full control over them.
Another use case to consider is if the user adds the system integration from the integrations app. In this case, the user probably does not intend to add a duplicate integration policy. We might not to recommend adding the system integration to a new agent policy, if the user is already adding one.
UI automation tests: #121436
Related:
Blockers:
CC @dborodyansky @mukeshelastic