fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.15k stars 431 forks source link

Increase default scripts timeout #15196

Closed dherder closed 10 months ago

dherder commented 1 year ago
## Goal | User story | |:---------------------------------------------------------------------------| | As an IT admin, | I want the script execution timeout to be 5 minutes, | so that I can run scripts that need more time to finish. ## Changes ### Product - [ ] UI changes: [Figma link](https://www.figma.com/file/BludES5pB2bgYimFakgzKK/%2315196-Increase-default-scripts-timeout?type=design&node-id=304-65&mode=design&t=9aSoRKxVC9NuSNNZ-11) - [ ] CLI usage changes: [Figma link](https://www.figma.com/file/BludES5pB2bgYimFakgzKK/%2315196-Increase-default-scripts-timeout?type=design&node-id=357-217&mode=design&t=9aSoRKxVC9NuSNNZ-11) - [ ] Outdated documentation changes: [Docs PR](https://github.com/fleetdm/fleet/pull/15483) ### Engineering - [ ] Database schema migrations: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

QA

Manual testing steps

  1. Verify UI and CLI copy updates against Figma
  2. Validate scripts timeout is extended to 5 minutes when running from both UI and CLI
  3. Validate Scripts output accurately reflects timeout behavior
  4. Verify no regression in scripts behavior

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming succesful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming succesful completion of QA.
noahtalerman commented 11 months ago

@marko-lisica I recorded a Loom w/ feedback here: https://www.loom.com/share/b79dc2a639634b54ab6982af00810aa9?sid=798b198e-e67c-454d-8a06-199cb1f313f5

Let's take a quick look at it during tomorrow's design review.

noahtalerman commented 11 months ago

Hey @marko-lisica a thought just occurred to me, how can we make this a simpler change that benefits everyone w/o adding to our config surface area?

I think we just update the default script timeout to 5 minutes.

If you agree, can you please update the story to this? If you want to discuss further, please bring this to design review. Thanks!

marko-lisica commented 11 months ago

@noahtalerman Do you mean to have 5 minutes by default and cut the possibility to configure different timeout for now?

I think this makes sense. We can always improve this later if we find that it's necessary to increase timeout even more.

noahtalerman commented 11 months ago

Do you mean to have 5 minutes by default and cut the possibility to configure different timeout for now?

@marko-lisica exactly. We can make the configuration improvement later.

noahtalerman commented 11 months ago

@marko-lisica Looks good!

I think we also want to show the UI equivalent for the error if the Fleet server doesn’t hear about from the agent for 5 mins, right?

Screenshot 2023-12-11 at 6 58 01 PM

Here’s the current version of this UI I found in the old scripts story: https://www.figma.com/file/bxkbHnOnFjE7epQj1d3yqt/%239537-Script-execution%3A-Manage-and-run-saved-scripts?type=design&node-id=210%3A1824&mode=design&t=QW1HdjF8UTOHgV61-1

marko-lisica commented 11 months ago

I think we also want to show the UI equivalent for the error if the Fleet server doesn’t hear about from the agent for 5 mins, right?

@noahtalerman That's right. Added this update to Figma.

georgekarrv commented 11 months ago

Hey team! Please add your planning poker estimate with Zenhub @gillespi314 @roperzh @mna

mna commented 11 months ago

Following discussion at standup:

From the script execution point-of-view, fleetd will now allow 5 minutes to run the script (instead of timing-out after 30 seconds). This is not an issue.

But for the /scripts/run/sync endpoint that runs scripts and waits for a response, we are concerned that waiting 5 minutes may cause issues at the server-level, because we can't set the timeout on a per-endpoint basis, every Fleet endpoint would now be granted a 5 minute timeout which is quite long.

If we really want the fleetctl users that run a synchronous script to wait the full 5 minutes, what we suggest instead would be to have the /scripts/run/sync API endpoint time-out after 1 minute (as it does currently) but fleetctl would keep polling for results at that point, up until 5 minutes and give up at that point (the same way it stops today after 1 minute).

In addition, it would then be able to print messages to the user at regular intervals, letting them know that it is still waiting for the script's results, so in that way it is even better than if the API endpoint had a 5 minute timeout and fleetctl was just waiting on its response.

The only noticeable difference this would make is for API users that call the /sync endpoint. It would timeout after 1 minute (as it does today) even though the script might run for 5 minutes. Script users would need to account for that by polling for the results if they wanted to.

marko-lisica commented 11 months ago

@noahtalerman Regarding Martin's message above. This slightly changes the UX for CLI and API users. Should we bring this back to emergency drafting and discuss it during DR tomorrow?

mna commented 11 months ago

@marko-lisica @noahtalerman For completeness' sake, just want to add that we do have a way forward to implement the 5-minute timeout just for that endpoint after all (new with Go 1.20, see my comment here: https://fleetdm.slack.com/archives/C03C41L5YEL/p1703007198937929?thread_ts=1703003087.228719&cid=C03C41L5YEL) but as @gillespi314 mentions in that slack thread, we may still want to add periodic feedback to the user while waiting for a response, and the shorter polling time still has some advantages as she mentions.

noahtalerman commented 10 months ago

@georgekarrv to meet w/ @sharon-fdm to see if MDM team can get some help from Endpoint ops.

roperzh commented 10 months ago

@georgekarrv @sharon-fdm sorry for the misscomunication here, Luke asked me to look into this (slack thread) and this is already completed. I'm moving back to MDM if that's fine

sabrinabuckets commented 10 months ago

Able to consistently verify one of the two 5-minute timeout errors:

Error: Fleet hasn’t heard from the host in over 5 minutes. Fleet doesn’t know if the script ran because the host went offline.

Error: Timeout. Fleet stopped the script after 5 minutes to protect host performance.

noahtalerman commented 10 months ago

@dherder heads up, this story was shipped in 4.43

fleet-release commented 10 months ago

Scripts flowing like a stream, Five minutes brings calm to the team, Secure, serene dream.