Open nonpunctual opened 5 months ago
Crowdstrikes powershell install script is 520 loc.
@nonpunctual, @dherder, @harrisonravazzolo, and @mikermcneil I filed a separate "Increase character limit for saved scripts" story and pulled it into the current design sprint here: #16668
This way, we can move quickly on the script character limit which is blocking customers workflows.
I updated this story to only cover increasing the timeout limit for scripts. Do we know of any customers that are blocked by the 5 minute timeout? What scripts are they trying to run?
@noahtalerman Thanks. I feel like we are going around in circles a bit on this.
I think we should assume that 5m isn't enough the same way we are going to assume that 10000 characters isn't enough. In my opinion, there should not be arbitrary limits on the scripts we allow customers to run.
Other products don't seem to have these limitations. Ideally, I think we should address the design choices that led us to them. but, for now, adjusting these limits is good for customer needs.
@nonpunctual it makes sense to bump the timeout. Bringing this to feature fest.
design choices that led us to them
For timeouts, we want to save the users from themselves. If a script never ends then this will prevent all other scripts from running.
Then we perhaps need to reconsider the queuing feature.
It's good to be prescriptive & opinionated & drive the user towards certain behaviors.
But on the admin / user side I don't know if the script feature is the place for it. There is a sense in which we have to assume the admin knows what they are doing or what they intend to do & we should let them do it.
I like the idea of the queue but I believe there is a fundamental misunderstanding of how scripts are used in this design. If we make every script contingent on every other script & in my queue are scripts that have nothing to do with each other, we are creating contingency for no reason.
Heads up @nonpunctual , this feature request was brought to feature fest on 2024-02-15 and wasn't prioritized for the current design sprint.
Hey @harrisonravazzolo, curious to get your feedback on this one.
Have y'all run into any scenarios in which a script you were trying to run was cut off after 5 minutes? If yes, what was the script?
Hey @noahtalerman - haven't run into this one yet as the script we want to run is over the 10k char limit. This will be the first time we use the script feature in Fleet
UPDATE: We chose to push this out of the design sprint. Why? We haven't heard of a customer running into issues w/ the 5 min limit yet. Once we get this feedback we can adjust. We want evidence that 5 minutes is too short before we make changes.
Hey @nonpunctual heads up, this story was prioritized during feature fest.
Aiming to ship an improvement in the next 6 weeks.
UPDATE: We chose to push this out of the design sprint. Why? We haven't heard of a customer running into issues w/ the 5 min limit yet. Once we get this feedback we can adjust. We want evidence that 5 minutes is too short before we make changes.
FYI @nonpunctual
Ok. I feel like we are going to be forcing Support to handle this problem. I don't think we have a representative sample of customers using the script feature to make this decision based on customer feedback.
I feel pretty strongly that this design for scripts should be revisited. Thanks.
@nonpunctual, please let me know right away when a customer runs into the 5 minute timeout so we can prioritize changing the product.
We decided to prioritize other feature requests over this one because no customers are feeling the pain (yet).
Once they do, we'll follow up quickly with an improvement.
@noahtalerman comment from customer-flacourtia:
Customer-preston has also reported running into the 5m limit.
If we enable longer script execution times as a synchronous scripting option, we'll also have to make sure to extend load-balancer timeouts to slightly exceed this timeout. For example 305 seconds on existing 5-minute timeouts is what we use in cloud.
This would not be needed if an asynchronous/callback method were leveraged for longer-running scripts.
related: Timeout script remains in upcoming activities without displaying and BLOCKS other scripts to be executed https://github.com/fleetdm/fleet/issues/19059
customer-preston has again reported issues with 5m timeout on WIndows.
Want to reiterate emphasis on this issue per meeting with customer-preston 20240522.
customer-preston:
So, @noahtalerman based on your comment from Mar 8, do the issues raised by customers regarding this feature since then clarify this issue? In my opinion, based on the feedback, this could probably be converted to a bug. Thanks.
@noahtalerman @lukeheath added dogfood label to this issue per @spokanemac comments. Thanks.
We could capture the PID.
pid=$!
So we could issue a kill command if needed.
@nonpunctual this is still not classified as a bug even if users 'have issues with the 5m timeout'.
We understand that this is a frustrating experience when you are doing scripts that are longer than 5m but this is still working as intended and is not a bug. Bugs are a failure to execute a specific workflow that is supported. As of today scripts longer than 5m are not supported so this is still a story requesting the timeout be updated.
From customer-preston:
doing script-based app management sometimes download multi-GB apps we use one script to install all customer apps Our app-install scripts timeout We consider it still running, since it's in the queue, so we do not attempt to run it again Any other script (for recovery key, MB agent, ...) is queued and NOT RUN since the queue can't be purged This is obviously a HUGE problem, since App Management is a key part of any MDM value prop, but especially on SMBs + this blocks any other form of script exec We really need solutions from you on this, starting with but not limited to: More permissive rules around script run time
@marko-lisica do you think this will make it into design review to be ready for estimation and get into 4.53? thanks!
Hey @zayhanlon, I'm going to bring this story to design review tomorrow. I believe we should be able to get it through so we can estimate it on Wednesday.
Hey @jacobshandling, re: our conversation yesterday, we decided to use seconds instead of minutes in error messages in the UI and CLI. Later we can improve this if necessary.
@lukeheath @noahtalerman , I approved this story going into the RC since we have customers waiting for it. It is ETAed to end of today. We should probably add a P2 label so not to break the process (will have no real effect since we will wrap it up anyway) TMWYT
Goal
apt-update
command on Linux).Context
Admins do not wait for scripts to run. There should be no conflation around the idea of Fleet doing something "bad" when or if there is not an instantaneous script result from a host.
See: https://github.com/fleetdm/fleet/issues/9583#issuecomment-1924196025
I think we should look at how the script features are implemented, but, until that work can be done the proposal is to increase limits:
https://docs.google.com/document/d/1Znyp2a9qcM9JdYHrzLudvcPwEdhnCg7RiKi22s8yGWw/edit
Changes
Product
PATCH /api/v1/fleet/config
,POST /api/v1/fleet/spec/teams
andPOST /api/v1/fleet/teams/:id/agent_options
to supportscript_execution_timeout
underagent_options
script_execution_timeout
. #19822Engineering
QA
Risk assessment
Manual testing steps
script_execution_time
in agent options (global and team). Note that there are no minimum values, but I would avoid 0 or 1 to be safe. The main purpose is to extend beyond 5 minutes.sleep 1200
in your script, the response to the script will not be sent by the agent until the sleep command completes, however no futher commands after that sleep command will execute. For testing I used a script which loopedsleep 1
, which had the desired timeout effect.fleetctl run-script
command does not timeout, even on longer (5+ min) scripts. This could happen on previous fleetctl builds because theres a timeout on the load balancer or by the fleet server. New fleetctl doesn't wait for the result in a single request, but instead polls the /scripts/results/ endpoint for a result to the script.Testing notes
Confirmation