Increase timeout limit for scripts

nonpunctual commented 5 months ago

Goal

User story
As an IT admin,
I want to increase timeout limit for scripts
so that I can run larger scripts that take more than 5 minutes (e.g. scripts that download files like `apt-update` command on Linux).

Context

Requestor(s): @nonpunctual
Product designer: @marko-lisica

Admins do not wait for scripts to run. There should be no conflation around the idea of Fleet doing something "bad" when or if there is not an instantaneous script result from a host.

See: https://github.com/fleetdm/fleet/issues/9583#issuecomment-1924196025

I think we should look at how the script features are implemented, but, until that work can be done the proposal is to increase limits:

script character limit = not less than 1MB of data
script timeout = not less than 30m

https://docs.google.com/document/d/1Znyp2a9qcM9JdYHrzLudvcPwEdhnCg7RiKi22s8yGWw/edit

Changes

Product

[ ] UI changes: Figma link
[ ] CLI usage changes: Figma link
[ ] REST API changes: Update PATCH /api/v1/fleet/config, POST /api/v1/fleet/spec/teams and POST /api/v1/fleet/teams/:id/agent_options to support script_execution_timeout under agent_options
[ ] Fleet's agent (fleetd) changes: New agent option script_execution_timeout. #19822
[ ] Permissions changes: Admin and GitOps (global and team) can edit agent options
[ ] Outdated documentation changes: Add note that "Run live script" endpoint is deprecated and add a link to agent configuration to the Scripts page in Using Fleet.

Engineering

[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

set script_execution_time in agent options (global and team). Note that there are no minimum values, but I would avoid 0 or 1 to be safe. The main purpose is to extend beyond 5 minutes.
when testing note there is a released bug (only confirmed on macOS so far) where the timeout does not take effect until the active command in the script finishes. For example if you have a sleep 1200 in your script, the response to the script will not be sent by the agent until the sleep command completes, however no futher commands after that sleep command will execute. For testing I used a script which looped sleep 1, which had the desired timeout effect.
ensure the fleetctl run-script command does not timeout, even on longer (5+ min) scripts. This could happen on previous fleetctl builds because theres a timeout on the load balancer or by the fleet server. New fleetctl doesn't wait for the result in a single request, but instead polls the /scripts/results/ endpoint for a result to the script.
error codes should follow what is in the Figma

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

harrisonravazzolo commented 5 months ago

Crowdstrikes powershell install script is 520 loc.

noahtalerman commented 5 months ago

@nonpunctual, @dherder, @harrisonravazzolo, and @mikermcneil I filed a separate "Increase character limit for saved scripts" story and pulled it into the current design sprint here: #16668

This way, we can move quickly on the script character limit which is blocking customers workflows.

I updated this story to only cover increasing the timeout limit for scripts. Do we know of any customers that are blocked by the 5 minute timeout? What scripts are they trying to run?

nonpunctual commented 5 months ago

@noahtalerman Thanks. I feel like we are going around in circles a bit on this.

I think we should assume that 5m isn't enough the same way we are going to assume that 10000 characters isn't enough. In my opinion, there should not be arbitrary limits on the scripts we allow customers to run.

Other products don't seem to have these limitations. Ideally, I think we should address the design choices that led us to them. but, for now, adjusting these limits is good for customer needs.

noahtalerman commented 5 months ago

@nonpunctual it makes sense to bump the timeout. Bringing this to feature fest.

design choices that led us to them

For timeouts, we want to save the users from themselves. If a script never ends then this will prevent all other scripts from running.

nonpunctual commented 5 months ago

Then we perhaps need to reconsider the queuing feature.

It's good to be prescriptive & opinionated & drive the user towards certain behaviors.

But on the admin / user side I don't know if the script feature is the place for it. There is a sense in which we have to assume the admin knows what they are doing or what they intend to do & we should let them do it.

I like the idea of the queue but I believe there is a fundamental misunderstanding of how scripts are used in this design. If we make every script contingent on every other script & in my queue are scripts that have nothing to do with each other, we are creating contingency for no reason.

noahtalerman commented 4 months ago

Heads up @nonpunctual , this feature request was brought to feature fest on 2024-02-15 and wasn't prioritized for the current design sprint.

noahtalerman commented 4 months ago

Hey @harrisonravazzolo, curious to get your feedback on this one.

Have y'all run into any scenarios in which a script you were trying to run was cut off after 5 minutes? If yes, what was the script?

harrisonravazzolo commented 4 months ago

Hey @noahtalerman - haven't run into this one yet as the script we want to run is over the 10k char limit. This will be the first time we use the script feature in Fleet

noahtalerman commented 4 months ago

UPDATE: We chose to push this out of the design sprint. Why? We haven't heard of a customer running into issues w/ the 5 min limit yet. Once we get this feedback we can adjust. We want evidence that 5 minutes is too short before we make changes.

Hey @nonpunctual heads up, this story was prioritized during feature fest.

Aiming to ship an improvement in the next 6 weeks.

noahtalerman commented 4 months ago

UPDATE: We chose to push this out of the design sprint. Why? We haven't heard of a customer running into issues w/ the 5 min limit yet. Once we get this feedback we can adjust. We want evidence that 5 minutes is too short before we make changes.

FYI @nonpunctual

nonpunctual commented 4 months ago

Ok. I feel like we are going to be forcing Support to handle this problem. I don't think we have a representative sample of customers using the script feature to make this decision based on customer feedback.

I feel pretty strongly that this design for scripts should be revisited. Thanks.

noahtalerman commented 4 months ago

@nonpunctual, please let me know right away when a customer runs into the 5 minute timeout so we can prioritize changing the product.

We decided to prioritize other feature requests over this one because no customers are feeling the pain (yet).

Once they do, we'll follow up quickly with an improvement.

nonpunctual commented 2 months ago

@noahtalerman comment from customer-flacourtia:

Customer-preston has also reported running into the 5m limit.

rfairburn commented 2 months ago

If we enable longer script execution times as a synchronous scripting option, we'll also have to make sure to extend load-balancer timeouts to slightly exceed this timeout. For example 305 seconds on existing 5-minute timeouts is what we use in cloud.

This would not be needed if an asynchronous/callback method were leveraged for longer-running scripts.

nonpunctual commented 1 month ago

related: Timeout script remains in upcoming activities without displaying and BLOCKS other scripts to be executed https://github.com/fleetdm/fleet/issues/19059

customer-preston has again reported issues with 5m timeout on WIndows.

nonpunctual commented 1 month ago

Want to reiterate emphasis on this issue per meeting with customer-preston 20240522.

nonpunctual commented 1 month ago

customer-preston:

nonpunctual commented 1 month ago

So, @noahtalerman based on your comment from Mar 8, do the issues raised by customers regarding this feature since then clarify this issue? In my opinion, based on the feedback, this could probably be converted to a bug. Thanks.

nonpunctual commented 1 month ago

@noahtalerman @lukeheath added dogfood label to this issue per @spokanemac comments. Thanks.

spokanemac commented 1 month ago

We could capture the PID. pid=$! So we could issue a kill command if needed.

georgekarrv commented 1 month ago

@nonpunctual this is still not classified as a bug even if users 'have issues with the 5m timeout'.

We understand that this is a frustrating experience when you are doing scripts that are longer than 5m but this is still working as intended and is not a bug. Bugs are a failure to execute a specific workflow that is supported. As of today scripts longer than 5m are not supported so this is still a story requesting the timeout be updated.

nonpunctual commented 1 month ago

From customer-preston:

doing script-based app management sometimes download multi-GB apps we use one script to install all customer apps Our app-install scripts timeout We consider it still running, since it's in the queue, so we do not attempt to run it again Any other script (for recovery key, MB agent, ...) is queued and NOT RUN since the queue can't be purged This is obviously a HUGE problem, since App Management is a key part of any MDM value prop, but especially on SMBs + this blocks any other form of script exec We really need solutions from you on this, starting with but not limited to: More permissive rules around script run time

nonpunctual commented 1 month ago

https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/

zayhanlon commented 4 weeks ago

@marko-lisica do you think this will make it into design review to be ready for estimation and get into 4.53? thanks!

marko-lisica commented 4 weeks ago

Hey @zayhanlon, I'm going to bring this story to design review tomorrow. I believe we should be able to get it through so we can estimate it on Wednesday.

marko-lisica commented 1 week ago

Hey @jacobshandling, re: our conversation yesterday, we decided to use seconds instead of minutes in error messages in the UI and CLI. Later we can improve this if necessary.

sharon-fdm commented 5 days ago

@lukeheath @noahtalerman , I approved this story going into the RC since we have customers waiting for it. It is ETAed to end of today. We should probably add a P2 label so not to break the process (will have no real effect since we will wrap it up anyway) TMWYT

fleetdm / fleet