Closed lucasmrod closed 1 month ago
@noahtalerman @sharon-fdm We should prioritize this released bug because it may prevent customers from using the GitOps feature to upload their software on their teams (works well when using 4 or 5 pieces of software, starts to timeout when using 10+ of 100MB+ each, depending on the infrastructure network download speed).
@lukeheath, @noahtalerman, this is affecting the Software area. Recommend we put a P2 for next sprint. @lucasmrod, can you guesstimate this? If that's a small fix maybe add to 4.57?
3 pts to: remove existing 100s write timeout + make sure synchronous live queries API continue to terminate after the 90s timeout + check nothing else breaks.
Hey @sharon-fdm and @lucasmrod, makes sense to me.
Also heads up that anyone can add P-
labels.
For future issues, please feel free to add P2
and ping @lukeheath per the process here: https://fleetdm.com/handbook/company/communications#high-priority-user-stories-and-bugs
I added P2
.
@noahtalerman, looks like @lucasmrod will manage to push this by EOD Today, so if all goes well this could hit 4.57.0.
P2 approved.
@noahtalerman @lukeheath
This is a released bug where GitOps will timeout at 100s while uploading software packages to a team.
From discussions with the backend team it seems the synchronous POST /api/latest/fleet/software/batch
API was designed+implemented (at the time) to support up to 5 packages per team.
Customers may hit this depending on:
(I hit this issue locally with 5/6 packages of 200 MBs+.)
A solution we discussed is to have a new asynchronous API (estimate to have a PR EoD today, maybe reviewed and merged tomorrow).
Let us know if we want this to block v4.57.0 given it might delay it a extra day or so.
@lucasmrod instead of a new API endpoint, could we make the existing POST /software/batch
API endpoint async?
I think up to you. Just throwing this out there.
cc @lukeheath
POST /api/latest/fleet/software/batch
asynchronous.GET /api/latest/fleet/software/batch?uuid=<UUID>
(or some other path). Returns status
and, if status
= "completed", then it returns the applied software (GitOps needs the title ids of the applied software for linking these to policies at a later stage.)POST /api/latest/fleet/software/batch
for uploading packages (same as today).(key=softwareBatch+team_id, value=UUID+status)
GET /api/latest/fleet/software/batch?uuid=<UUID>
on a for loop (with sleep) until status != "pending", status can be "completed" or "failed". If status is "completed" then process the returned software title ids and continue processing. If status is "failed" then fail GitOps.(key=softwareBatch+team_id, value=UUID+"completed" or "failed")
softwareBatch+team_id
key/values in Redis have an expiration of 1 hour in case Fleet crashes or is terminated while the background process is running.@lucasmrod Thanks for the detailed proposal. That workflow makes sense to me, and I think it makes sense to hold the release for this. Please review the API changes with @rachaelshaw and make sure API docs are updated accordingly.
QA DRI - @iansltx
QA Plan:
On "too big + 404" we actually get "too big" for pulling MS Office...which is acceptable. They include a content-length header on the response so we can bail early rather than trying to download and getting stuck.
Upload, without scare, GitOps streams through the air, Time-outs, now rare.
Fleet version: v4.56.0 and before (basically since addition of GitOps for uploading software)
🧑💻 Steps to reproduce
Define many pieces of software on a single team (e.g. 5 or more of 200 MB+) using GitOps (
teams/b.yml
) and then rungitops.sh
.The issue is that by default Fleet will terminate requests that take more than 100 seconds. See how we set
srv.WriteTimeout
here: https://github.com/fleetdm/fleet/blob/bfeeba10cd9e35623c6680f429ff931b90ffa83b/cmd/fleet/serve.go#L1203-L1228🕯️ More info (optional)
Whatever solution we come up with, we have to check that we are not breaking the synchronous live query API that terminates after 90s or so (see FLEET_LIVE_QUERY_REST_PERIOD in. the code linked above).