fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.93k stars 408 forks source link

Script library for macOS #9537

Closed noahtalerman closed 9 months ago

noahtalerman commented 1 year ago

Goal

User story
As an IT admin,
I want to save a script in Fleet
so that IT help desk or I can run the script on a macOS host at any time to remediate an issue or collect logs.

Requirements

Changes

UI and CLI

https://www.figma.com/file/bxkbHnOnFjE7epQj1d3yqt/%239537-Script-execution%3A-Manage-and-run-saved-scripts?type=design&node-id=2-130&mode=design

QA

Manual testing steps (WIP):

  1. ✅Verify feature is not accessible in free instance (see designs for copy)
  2. ✅Verify feature is accessible with paid license
  3. ✅UI testing
    • ✅Create/delete a script
    • ✅Verify new Scripts tab is present on Controls page
    • ✅Teams dropdown is present
    • ✅each team has its own assigned scripts -✅ Scripts can be assigned to No team
    • ✅MDM is off: a prompt appears to turn it on (see designs for copy)
    • ✅MDM is on: an upload box is present (see designs for copy)
    • ✅Verify only Admins & Maintainers (including Team) has this capability
    • ✅Click upload
    • ✅Verify a script file ending in .sh can be uploaded
    • ✅Verify unable to select a file that ends in anything other than .sh
    • ✅Verify success/failure messages against the designs
    • ✅Verify uploaded scripts appear in the UI
    • ✅Download button downloads the file
    • ✅Delete button opens a confirmation modal, completing the dialog results in the script being deleted
    • ✅Running a script
    • ✅Scripts tab not present on Windows hosts
    • ✅Navigate to a host on a team with no scripts assigned, verify Scripts shows no scripts
    • ✅Navigate to a host on a team with scripts assigned
    • ✅A Scripts tab is present
    • ✅Clicking on the scripts tab shows all assigned scripts
    • ✅A status column is present
    • ✅Will display --- if never run
    • ✅Will display status (Pending/Error/Ran) if the script has attempted to run
    • ✅Validate status tooltip copy matches designs
    • ✅An Actions dropdown appears next to each script
    • ✅Options are Show details or Run
    • ✅Show details will pop a modal
      • ✅Spinner while API call is in progress
      • ✅Error page if request fails (see designs)
      • ✅On successful run, Script details modal displays the script and output (see designs)
    • ✅Can be run by:
      • ✅Admin/Team Admin
      • ✅Maintainer/Team Maintainer
      • ✅Observers (Observer+?)
    • ✅Activity feed
    • ✅Script added
    • ✅Script edited via fleetctl
    • ✅Script deleted
    • ✅Script ran 4.✅ API
    • ✅Permissions are same as UI
  4. ✅CLI
    • ✅Available to Admin and Maintainer (including team)
    • ✅Applied via YAML config (see designs)
    • ✅Able to add/edit/delete a script
    • ✅Able to target No team and specific teams
    • ✅Error states:
    • ✅If other than #!/bin/sh
    • ✅If script added with duplicate name
    • ✅If file does not exist
    • ✅If script is > 10k characters
  5. ✅On-device
    • ✅Run on macOS hosts
    • ✅Verify script runs successfully
noahtalerman commented 1 year ago

@mike-j-thomas if you have the time, can you please help me the the following design problems? No worries if you can't get to these this week. Please let me know :)

  1. Awkward empty space in the empty state Screenshot 2023-01-27 at 5 08 45 PM

  2. Awkward alignment for Ran, Pending, and Error statuses

  3. Inconsistent aligning for Rerun, Download, and Pending buttons (1st screenshot v. 2nd) Screenshot 2023-01-27 at 5 09 29 PM Screenshot 2023-01-27 at 5 10 25 PM

Here's the link to the Figma page: https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?node-id=12488%3A329702

I walk through these problems in a Loom video here: https://www.loom.com/spaces/All-Fleet-67132

mike-j-thomas commented 1 year ago

Hey @noahtalerman, thanks for the video. I don't think the cards are working too well here, with multiple uploads. I've added my updated version below your pages in Figma.

Assuming you guys are happy with what I've come up with, I just need to finish up updating the configuration profile upload to match.

Let me know what you think.

noahtalerman commented 1 year ago

@roperzh do you know what shell/interpreter fleetd might use if the script doesn't specify this?

During today's design review, we decided to allow the user to add any script and not do validation on the shell or interpreter provided in the script. I would like to let the user know what happens if they don't specify a shell/interpreter (see TODO): Screenshot 2023-01-30 at 2 28 45 PM

As an example, in Kandji, if a shell isn't specified, the script runs in /bin/sh: Screenshot 2023-01-30 at 2 29 35 PM

noahtalerman commented 1 year ago

@mike-j-thomas your updated "list" version looks great. I think it makes sense to update the profiles UI to match. Also, we now would like a generic script icon (if the script isn't .bsh, .zsh, or .py). I recorded a Loom video that walks through these asks: https://www.loom.com/spaces/All-Fleet-67132

roperzh commented 1 year ago

@roperzh do you know what shell/interpreter fleetd might use if the script doesn't specify this?

@noahtalerman that's up to us, but I think using /bin/sh is a very good choice.

noahtalerman commented 1 year ago

that's up to us, but I think using /bin/sh is a very good choice.

@roperzh got it. Let's go with /bin/sh

lukeheath commented 1 year ago

@noahtalerman Specs look good to me! 👍

noahtalerman commented 1 year ago

@lukeheath heads up, I moved this issue and these 2 script execution stories into the designed column:

noahtalerman commented 1 year ago

@mike-j-thomas your updated "list" version looks great. I think it makes sense to update the profiles UI to match. Also, we now would like a generic script icon (if the script isn't .bsh, .zsh, or .py). I recorded a Loom video that walks through these asks: https://www.loom.com/spaces/All-Fleet-67132

@mike-j-thomas please let me know if you don't have the time to get to this today^ (no worries if you're busy)

This way, I can make the changes before MDM estimation tomorrow.

mike-j-thomas commented 1 year ago

Hey @noahtalerman. Figma's updated. Thanks 🤘🏻

noahtalerman commented 1 year ago

@mike-j-thomas thank you!

lukeheath commented 1 year ago

Hey team! Please add your planning poker estimate with Zenhub @gillespi314 @mna @roperzh

noahtalerman commented 1 year ago

Hey @georgekarrv heads up, during design review today we cut the error message if the user uploads something other than a plain text script (ex. binary).

Today, we don't have any way to check this. If we let a binary through the user should see an error.

cc @roperzh @marko-lisica

noahtalerman commented 1 year ago

Hey @marko-lisica heads up for when you're back. I added some UI features:

mna commented 1 year ago

@noahtalerman @georgekarrv

From the ticket's description:

User can see the entire script output (not limited to 10,000 characters).

Note that the 10 000 characters limit is not just to protect performance for display, it is also to prevent transferring huge payloads over the network and storing them in the DB, so the 10 000 characters is applied to saving into the database (we don't store more than 10K chars).

noahtalerman commented 1 year ago

Note that the 10 000 characters limit is not just to protect performance for display, it is also to prevent transferring huge payloads over the network and storing them in the DB

@mna got it. Thanks for clarifying.

Sounds like, we need a new error message when the user tries to upload a script that's over 10,000 characters. I added these to the Figma:

UI: Screenshot 2023-08-30 at 5 34 20 PM

CLI: Screenshot 2023-08-30 at 5 36 55 PM

cc @georgekarrv

georgekarrv commented 1 year ago

This feature will only support mac hosts.

noahtalerman commented 1 year ago

Mike: What happens if Fleet is stuck in this lock state? How do we clear it? When we send a new script?

Mike: I like the vision of refusing to add a queue for these, until we understand a really strong reason. I like that we're challenging the assumption and also asking... wait: "Why would I ever even want to run two scripts at the same time on a host? Why would I want to have to manage a queue of scripts for a given host? That's really complicated."

Noah: We do it for profiles, but they're declarative. (whether or not it's DDM)

Noah: We do it for MDM commands (queuing).

Mike: Maybe we shouldn't. Wait no I'm wrong: the reason why we should is because you want the remote lock to be able to activate as soon as they boot.

Mike: But that's still a queue of one. Maybe we only need a queue of one. That's what we've designed for scripts, with this little lock

I think it would be fine to change this so that you only run one MDM command at a time. Can only queue one MDM command at a time.

TODO Noah: Find out whether there is a reason to have more than one MDM command running at the same time, or to queue more than one MDM command at the same time.

noahtalerman commented 1 year ago

What happens if Fleet is stuck in this lock state? How do we clear it? When we send a new script?

  • Noah: Maybe when you delete the script (so the workaround training would be "If you have a script that says it's still running after like 10 minutes and it says to wait a minute but it's still in that state, then you can fix it by deleting the script" ^get that into the docs somewhere
  • Mike: This matches the convention we've establishing for clearing the query clip lock (CX). i.e. edit SQL to get it to start collecting fresh results again instead of ignoring them due to having hit the clip limit

@mna do you know what happens if Fleet fails to remove an old script from the queue? Is the IT admin unable to run anymore scripts against the hosts? More broadly, will this ever happen / is this a valid concern?

If it is, does the proposed "delete the script to clear the lock" solution make sense? Are there other good solutions?

FWIW the MDM team discussed this solution briefly on Friday and we decided to not do this for now because we don't overload the delete action. I'm not convinced this is a good reason not to do it. I forgot that CX might be doing something similar.

mna commented 1 year ago

@noahtalerman I'm not sure what lock we're talking about in the new context of saved scripts, I'm going to assume this is the same "lock" that we implemented in the initial scripts execution feature (cannot run another script if a previous one is still running).

The way the lock is implemented in this case is that it expires automatically, currently after 1 minute, so if you send a request to run a script (sync or async) and for some reason we haven't received results from the host after 1 minute, we will stop sending this script to the host for execution and will start allowing another script execution request.

In this scenario, there's not really a need to "delete the script to clear the lock" because it will automatically expire (or it will end up getting results before expiration, both of which resolve the locking).

I'm pretty sure I'm not fully answering your question, sorry, I think we're discussing something a bit different...

noahtalerman commented 1 year ago

I'm going to assume this is the same "lock" that we implemented in the initial scripts execution feature

@mna that's right.

it expires automatically, currently after 1 minute

If I'm understanding correctly, we have some scheduled job/cron that runs every minute and checks, for each host, if there are scripts that we haven't gotten results for. We clear the old scripts and allow new scripts. Right?

So, if for some reason, Fleet is stuck in a state where we aren't clearing the old script for a host, what's the failsafe?

The "delete the script" idea was one failsafe I thought of. But we might already have one in place?

The concern here is that a host, for some reason, gets stuck in that state and the IT admin has no way to send scripts to it anymore...

mna commented 1 year ago

@noahtalerman

If I'm understanding correctly, we have some scheduled job/cron that runs every minute and checks, for each host, if there are scripts that we haven't gotten results for. We clear the old scripts and allow new scripts.

No, in fact there's no "explicit" lock, it's an implicit one that is time-based and automatically expires as time passes - Fleet cannot be stuck with such a lock. We only allow new script executions if there is not already one that is less than a minute old. As soon as the previous execution request is older than a minute, a new script execution is automatically allowed so we don't need a failsafe in that scenario. The only case where it could be problematic would be if the user does not want to wait a minute, but given how short that expiration is, it's probably not a problem in practice.

We don't have a cleanup job of old script execution requests at the moment (I'm not sure if we need/want one, as it would cause issues with the activities - they would fail to get the details when the user would click on "show details" in the activity stream, unless we also cleanup associated activities but I don't think we want that?).

noahtalerman commented 1 year ago

it's an implicit one that is time-based and automatically expires as time passes - Fleet cannot be stuck with such a lock.

@mna ah, ok! Thanks for the explanation. Understood we don't need a failsafe.

We don't have a cleanup job of old script execution requests at the moment

Got it. I agree we don't want one. Good point about how having one would affect the historical activities.

noahtalerman commented 1 year ago

TODO Noah: Find out whether there is a reason to have more than one MDM command running at the same time, or to queue more than one MDM command at the same time.

Noah: We need a queue to enable configuration profile workflows. For example, if I transfer a host from Team A to Team B, and Team B has many profiles, Fleet needs to send one command per profile.

If the host goes offline while half of the profiles are installed, IT admins expect the other half to be installed when the host comes online.

@roperzh any other reasons you can think of for why we need to queue MDM commands for Mac?

I want to question the assumption that we need a queue before we commit to building one for Windows. We didn't build one for scripts (I think because the current UX doesn't require one).

roperzh commented 1 year ago

@noahtalerman it's the way the macOS MDM protocol works, you can only send only one command at the time.

The flow is: 1. you send a command, 2. you wait for a response, 3. you optionally send another command or close the connection

From the docs:

When the server receives a response from the device, it can either reply with the next command or end the connection by sending a 200 status (OK) with an empty response body. [...] Don’t consider a command accepted and executed by a device until the server receives the Acknowledged or Error status with the command UUID in the message. Until then, leave the last command on the queue.

roperzh commented 1 year ago

@noahtalerman a thought: I think for MDM you'll end up building a queue anyways, even if you don't explicitly call it a queue.

Example: after enrollment we want the Fleet server to automatically do stuff on the device (made up examples):

  1. Configure something in the local account
  2. Install an application
  3. We also want to allow users to send commands on their own (like a customer does with Puppet immediately post-enrollment)

How do you manage all those things happening at the same time?

noahtalerman commented 1 year ago

How do you manage all those things happening at the same time?

@roperzh yeah, makes sense that we'll need a queue for Windows MDM commands. Thanks.

I guess for scripts we don't need a queue right now because we don't expect Fleet to run some scripts automatically for the Fleet user while enabling the Fleet user to run scripts of their own. It's just the latter for now.

That said, I think we do run some scripts for the Fleet user right? For example, the script that runs the profiles renew command. I'm guessing this script runs as part of a separate flow. They're not affected by scripts run by the Fleet user.

roperzh commented 1 year ago

That said, I think we do run some scripts for the Fleet user right? For example, the script that runs the profiles renew command. I'm guessing this script runs as part of a separate flow. They're not affected by scripts run by the Fleet user.

This is correct! I think it depends a little bit on how you frame things (for example: fleetd technically runs a script to start osquery and Fleet Desktop, some osquery extension tables technically run scripts to get their output, etc.)

As you mention they are part of a separate flow, some key differences:

  1. The scripts we run are "hardcoded" in the fleetd binary, we run them when:
    1. fleetd needs to know something about the system
    2. The Fleet Server asks fleetd to run the script
  2. The Fleet server never expects a return value, nor it cares about the exit code of those scripts, coming back to point 1.1 above, it only asks fleetd to run the script.
  3. From all of the above, those scripts run at different times, sometimes simultaneously
  4. We have more freedom to handle scripts, because we don't need to use a specific protocol like we do for MDM

I hope that helps!

noahtalerman commented 10 months ago

C&C:

@noahtalerman to check fleetdm.com/pricing and documentation v. guide.

noahtalerman commented 10 months ago

C&C: @noahtalerman to check fleetdm.com/pricing and see if we can fold into existing scripts doc page. While making more removals than additions.

noahtalerman commented 9 months ago

Doc updates are here: #15416

fleet-release commented 9 months ago

Scripts stored in cloud, Fleet aids with swift issue solve. Quiet as the snow.