elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
18 stars 144 forks source link

[Kubernetes manifest] Use unique identifier for the state file path #5187

Open tetianakravchenko opened 3 months ago

tetianakravchenko commented 3 months ago

Describe the enhancement:

in manifest we have elastic-agent-state and the hostPath is predefined:

        # Mount /var/lib/elastic-agent-managed/kube-system/state to store elastic-agent state
        # Update 'kube-system' with the namespace of your agent installation
        - name: elastic-agent-state
          hostPath:
            path: /var/lib/elastic-agent-managed/kube-system/state
            type: DirectoryOrCreate

as a result when customer want to remove installation kubectl delete -f manifest.yaml and install a new one (with the different FLEET_URL and FLEET_ENROLLMENT_TOKEN) existing state file will be used, that leads to the next error:

"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://XXXXXX.fleet.region.aws.found.io:443 ...

What is the definition of done?

Few ideas: we can use fleet url as: /var/lib/elastic-agent-managed/<fleet_url>/kube-system/state (like: /var/lib/elastic-agent-managed/f437b90409bb4804b1647665fa19f7a0.fleet.us-central1.gcp.cloud.es.io/kube-system/state, for local setup: /var/lib/elastic-agent-managed/fleet-serverkube-system/state) but what to do we there is no fleet server? fallback to default - /var/lib/elastic-agent-managed/kube-system/state ?

cmacknz commented 3 months ago

I think we need to treat a change in the FLEET_URL or FLEET_ENROLLMENT_TOKEN environment variables as equivalent to executing the elastic-agent enroll command.

blakerouse commented 3 months ago

@cmacknz I disagree, there are many reasons you might change those values after the Elastic Agent is already running and you don't what to have your Elastic Agents to re-enroll. Say you are updating the FLEET_URL because you just moved the cluster, or you just updated the FLEET_ENROLLMENT_TOKEN as a security policy of rotating tokens periodically.

Would be interesting to see if we could possibly make an anonymous call to Fleet Server and determine if this is the same Fleet Server?

cmacknz commented 3 months ago

Would be interesting to see if we could possibly make an anonymous call to Fleet Server and determine if this is the same Fleet Server?

Is just checking in, or doing anything that uses the stored API key enough to check this?

We could make calling the enroll endpoint idempotent in some situations, perhaps by allowing an optional agent.id parameter. This would allow getting the API key of an existing agent, instead of a net new one though which I don't love from a security perspective (edit: or the response could just not include the existing API key so that this is only an "is an agent with this ID enrolled" check).

blakerouse commented 3 months ago

Would be interesting to see if we could possibly make an anonymous call to Fleet Server and determine if this is the same Fleet Server?

Is just checking in, or doing anything that uses the stored API key enough to check this?

We could make calling the enroll endpoint idempotent in some situations, perhaps by allowing an optional agent.id parameter. This would allow getting the API key of an existing agent, instead of a net new one though which I don't love from a security perspective (edit: or the response could just not include the existing API key so that this is only an "is an agent with this ID enrolled" check).

@cmacknz I like the idempotent idea. We could just change it to return a HTTP conflict or specific response saying that it already exists and not return the API key again.

blakerouse commented 3 months ago

I just wanted to add a note here that if you set FLEET_FORCE=true in environment for the container that it will re-enroll on every restart. This doesn't actually solve this issue, but is a workaround when you are trying to migrate from one Fleet to another Fleet.