fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.11k stars 427 forks source link

GitOps timeout when uploading software #22069

Closed lucasmrod closed 1 month ago

lucasmrod commented 1 month ago

Fleet version: v4.56.0 and before (basically since addition of GitOps for uploading software)


🧑‍💻  Steps to reproduce

Define many pieces of software on a single team (e.g. 5 or more of 200 MB+) using GitOps (teams/b.yml) and then run gitops.sh.

FLEET_URL=... FLEET_B_ENROLL_SECRET=... FLEET_GLOBAL_ENROLL_SECRET=... ./gitops.sh
[...]
[+] applying 5 software packages for team B
Error: applying software installers for team "B": POST /api/latest/fleet/software/batch: do request: Post "https://localhost:8080/api/latest/fleet/software/batch?team_name=B": stream error: stream ID 45; INTERNAL_ERROR; received from peer

The issue is that by default Fleet will terminate requests that take more than 100 seconds. See how we set srv.WriteTimeout here: https://github.com/fleetdm/fleet/blob/bfeeba10cd9e35623c6680f429ff931b90ffa83b/cmd/fleet/serve.go#L1203-L1228

🕯️ More info (optional)

Whatever solution we come up with, we have to check that we are not breaking the synchronous live query API that terminates after 90s or so (see FLEET_LIVE_QUERY_REST_PERIOD in. the code linked above).

lucasmrod commented 1 month ago

@noahtalerman @sharon-fdm We should prioritize this released bug because it may prevent customers from using the GitOps feature to upload their software on their teams (works well when using 4 or 5 pieces of software, starts to timeout when using 10+ of 100MB+ each, depending on the infrastructure network download speed).

sharon-fdm commented 1 month ago

@lukeheath, @noahtalerman, this is affecting the Software area. Recommend we put a P2 for next sprint. @lucasmrod, can you guesstimate this? If that's a small fix maybe add to 4.57?

lucasmrod commented 1 month ago

3 pts to: remove existing 100s write timeout + make sure synchronous live queries API continue to terminate after the 90s timeout + check nothing else breaks.

noahtalerman commented 1 month ago

Hey @sharon-fdm and @lucasmrod, makes sense to me.

Also heads up that anyone can add P- labels.

For future issues, please feel free to add P2 and ping @lukeheath per the process here: https://fleetdm.com/handbook/company/communications#high-priority-user-stories-and-bugs

I added P2.

sharon-fdm commented 1 month ago

@noahtalerman, looks like @lucasmrod will manage to push this by EOD Today, so if all goes well this could hit 4.57.0.

lukeheath commented 1 month ago

P2 approved.

lucasmrod commented 1 month ago

@noahtalerman @lukeheath

This is a released bug where GitOps will timeout at 100s while uploading software packages to a team. From discussions with the backend team it seems the synchronous POST /api/latest/fleet/software/batch API was designed+implemented (at the time) to support up to 5 packages per team.

Customers may hit this depending on:

(I hit this issue locally with 5/6 packages of 200 MBs+.)

A solution we discussed is to have a new asynchronous API (estimate to have a PR EoD today, maybe reviewed and merged tomorrow).

Let us know if we want this to block v4.57.0 given it might delay it a extra day or so.

noahtalerman commented 1 month ago

@lucasmrod instead of a new API endpoint, could we make the existing POST /software/batch API endpoint async?

I think up to you. Just throwing this out there.

cc @lukeheath

lucasmrod commented 1 month ago

Proposal

Workflow

  1. GitOps executes POST /api/latest/fleet/software/batch for uploading packages (same as today).
  2. Fleet generates a random UUID and stores the following item on Redis: (key=softwareBatch+team_id, value=UUID+status)
  3. Fleet will start the download+upload process of the URLs in the background.
  4. Fleet returns UUID to GitOps [0] request.
  5. GitOps runs GET /api/latest/fleet/software/batch?uuid=<UUID> on a for loop (with sleep) until status != "pending", status can be "completed" or "failed". If status is "completed" then process the returned software title ids and continue processing. If status is "failed" then fail GitOps.
  6. When Fleet finishes [3] it will set (key=softwareBatch+team_id, value=UUID+"completed" or "failed")
lukeheath commented 1 month ago

@lucasmrod Thanks for the detailed proposal. That workflow makes sense to me, and I think it makes sense to hold the release for this. Please review the API changes with @rachaelshaw and make sure API docs are updated accordingly.

sharon-fdm commented 1 month ago

QA DRI - @iansltx

iansltx commented 1 month ago

QA Plan:

iansltx commented 1 month ago

On "too big + 404" we actually get "too big" for pulling MS Office...which is acceptable. They include a content-length header on the response so we can bail early rather than trying to download and getting stuck.

fleet-release commented 1 month ago

Upload, without scare, GitOps streams through the air, Time-outs, now rare.