Closed dherder closed 10 months ago
@marko-lisica I recorded a Loom w/ feedback here: https://www.loom.com/share/b79dc2a639634b54ab6982af00810aa9?sid=798b198e-e67c-454d-8a06-199cb1f313f5
Let's take a quick look at it during tomorrow's design review.
Hey @marko-lisica a thought just occurred to me, how can we make this a simpler change that benefits everyone w/o adding to our config surface area?
I think we just update the default script timeout to 5 minutes.
If you agree, can you please update the story to this? If you want to discuss further, please bring this to design review. Thanks!
@noahtalerman Do you mean to have 5 minutes by default and cut the possibility to configure different timeout for now?
I think this makes sense. We can always improve this later if we find that it's necessary to increase timeout even more.
Do you mean to have 5 minutes by default and cut the possibility to configure different timeout for now?
@marko-lisica exactly. We can make the configuration improvement later.
@marko-lisica Looks good!
I think we also want to show the UI equivalent for the error if the Fleet server doesn’t hear about from the agent for 5 mins, right?
Here’s the current version of this UI I found in the old scripts story: https://www.figma.com/file/bxkbHnOnFjE7epQj1d3yqt/%239537-Script-execution%3A-Manage-and-run-saved-scripts?type=design&node-id=210%3A1824&mode=design&t=QW1HdjF8UTOHgV61-1
I think we also want to show the UI equivalent for the error if the Fleet server doesn’t hear about from the agent for 5 mins, right?
@noahtalerman That's right. Added this update to Figma.
Hey team! Please add your planning poker estimate with Zenhub @gillespi314 @roperzh @mna
Following discussion at standup:
From the script execution point-of-view, fleetd will now allow 5 minutes to run the script (instead of timing-out after 30 seconds). This is not an issue.
But for the /scripts/run/sync
endpoint that runs scripts and waits for a response, we are concerned that waiting 5 minutes may cause issues at the server-level, because we can't set the timeout on a per-endpoint basis, every Fleet endpoint would now be granted a 5 minute timeout which is quite long.
If we really want the fleetctl
users that run a synchronous script to wait the full 5 minutes, what we suggest instead would be to have the /scripts/run/sync
API endpoint time-out after 1 minute (as it does currently) but fleetctl
would keep polling for results at that point, up until 5 minutes and give up at that point (the same way it stops today after 1 minute).
In addition, it would then be able to print messages to the user at regular intervals, letting them know that it is still waiting for the script's results, so in that way it is even better than if the API endpoint had a 5 minute timeout and fleetctl
was just waiting on its response.
The only noticeable difference this would make is for API users that call the /sync endpoint. It would timeout after 1 minute (as it does today) even though the script might run for 5 minutes. Script users would need to account for that by polling for the results if they wanted to.
@noahtalerman Regarding Martin's message above. This slightly changes the UX for CLI and API users. Should we bring this back to emergency drafting and discuss it during DR tomorrow?
@marko-lisica @noahtalerman For completeness' sake, just want to add that we do have a way forward to implement the 5-minute timeout just for that endpoint after all (new with Go 1.20, see my comment here: https://fleetdm.slack.com/archives/C03C41L5YEL/p1703007198937929?thread_ts=1703003087.228719&cid=C03C41L5YEL) but as @gillespi314 mentions in that slack thread, we may still want to add periodic feedback to the user while waiting for a response, and the shorter polling time still has some advantages as she mentions.
@georgekarrv to meet w/ @sharon-fdm to see if MDM team can get some help from Endpoint ops.
@georgekarrv @sharon-fdm sorry for the misscomunication here, Luke asked me to look into this (slack thread) and this is already completed. I'm moving back to MDM if that's fine
Able to consistently verify one of the two 5-minute timeout errors:
Error: Fleet hasn’t heard from the host in over 5 minutes. Fleet doesn’t know if the script ran because the host went offline.
Error: Timeout. Fleet stopped the script after 5 minutes to protect host performance.
@dherder heads up, this story was shipped in 4.43
Scripts flowing like a stream, Five minutes brings calm to the team, Secure, serene dream.
Context
QA
Manual testing steps
Testing notes
Confirmation