Deleting large number of hosts through UI enters broken state

xpkoala commented 1 year ago

Fleet version: fleet-v4.37.0-132-gb6f072775-dirty

Web browser and operating system: n/a Safari 16.2

💥 Actual behavior

When deleting a large number of hosts, the waiting modal will eventually break but the host deletions will continue in the background.

🧑‍💻 Steps to reproduce

Enroll a large number of hosts to your Fleet instance (10k or so)
Select all hosts and choose to delete
Wait on the page until the modal breaks

🕯️ More info (optional)

The video and screenshot were taken at different times, hence the host count being off between the two.

Seen below is a view of the modal breaking. Screenshot 2023-09-22 at 2 08 01 PM

After the above break the user can refresh the page and see that the delete action is continuing in the background. https://github.com/fleetdm/fleet/assets/11012743/6c3beeec-2dc8-4e8b-91ad-7af0b87077f4

🛠️ To fix

Back-end:

Let's go with @getvictor's suggestion:

For delete endpoint, we can add a custom timeout to return a 202 (Accepted) response. This would be in addition to the larger global timeout that would return a 503.

If frontend sees a 202 response, they can inform user that the operation continues to run in the background.

Front-end:

Update "Delete host" modal to show an additional message when a large number of hosts is selected: Figma

lucasmrod commented 1 year ago

IIRC this is an issue also happening on previous releases. (Please correct me if I'm wrong.)

lucasmrod commented 1 year ago

IIRC the frontend is making two delete hosts API calls (should be making only one).

sharon-fdm commented 1 year ago

@RachelElysia : If this is a FE issue it's a 1pt

sharon-fdm commented 1 year ago

@lucasmrod : I can provide a script to create 100 host (to be deleted for repeo)

sharon-fdm commented 1 year ago

@lucasmrod : I can provide a script to create 100 host (to be deleted for repro)

RachelElysia commented 1 year ago

So it looks like this might be a Chrome specific issue with network connections.

Private Zenhub Image Private Zenhub Image

I looked up how to fix net::ERR_NETWORK_CHANGED error showing up in the ChromeOS console tab since there were no errors surfacing in the ChromeOS network tab and came across a video saying it might be Chrome specific. Next steps: Testing on safari and ff, and trying to modify my network settings to see if it fixes it.

RachelElysia commented 1 year ago

This time on ChromeOS, it was a different error:

Private Zenhub Image

Quick google search suggests: "It turns out that some webservers will cut the connection to a client if the server's data throughput to the client passes below a certain limit. This is to protect against "slow drip" denial of service attacks. However, this limit can also be triggered in cases where an innocent user requests many resources all at once (such as lots of images on a single page), and the server is forced to ration the bandwidth for each request so much that it causes one or more requests to drop below the throughput limit, which causes the server to cut the connection and shows up as net::ERR_HTTP2_PROTOCOL_ERROR in Chrome."

RachelElysia commented 1 year ago

Update: Had similar issue trying to delete 3000 hosts on FF. Successfully finished deleting 2708 hosts before erroring out.

Private Zenhub Image

lucasmrod commented 1 year ago

Seems it might be a backend issue after all?

RachelElysia commented 1 year ago

Same issue with Safari (see screenshot below)

My conclusion is it's 1) not a front end issue 2) not a browser issue.

My best guess is it might be a webserver limit issue but I have no experience on that end.

@sharon-fdm can I reassign this to you to chat with the infra team to dig into it?

Private Zenhub Image

lucasmrod commented 1 year ago

I would assign backend, because you are reproducing in localhost AFAICS so there's no infra involved here.

getvictor commented 1 year ago

Need product input.

This behavior is due to our timeout settings on the Go server, like http.ReadTimeout and http.WriteTimeout. This results in the server killing the connection to the client after ~1.5 minutes. However, the delete operation continues to run on the server.

Since we are exposing our API, we should keep the timeouts to guard us against buggy/uncooperative clients.

Recommended backend fix: instead of killing the connection, use http.TimeoutHandler to return 503 (that's what that library method returns).

On the frontend, we cannot wait longer than our timeout for a response. One quick fix is to tell user that the operation is taking a while and will continue in the background.

My debug notes for reference: https://docs.google.com/document/d/1vsF29Bhk5MiTrApfkTMJ8IRc293_gsIo32UiWomNg18/edit

getvictor commented 1 year ago

@rachaelshaw Additional thoughts on this issue:

This timeout "feature" is present for REST API users and probably for command-line users as well. We should do a scan of the API to see what other methods may be slow when modifying thousands of records.

For delete endpoint, we can add a custom timeout to return a 202 (Accepted) response. This would be in addition to the larger global timeout that would return a 503.

If frontend sees a 202 response, they can inform user that the operation continues to run in the background.

rachaelshaw commented 1 year ago

Let's go with @getvictor's suggestion:

For delete endpoint, we can add a custom timeout to return a 202 (Accepted) response. This would be in addition to the larger global timeout that would return a 503.

If frontend sees a 202 response, they can inform user that the operation continues to run in the background.

Here is the UI change to go along with it: Figma

(I'll add this to the issue description as well)

rachaelshaw commented 1 year ago

@sharon-fdm there is now an additional UI element to this, may need re-estimation

sharon-fdm commented 1 year ago

Estimation for FE+BE : 2 pt

xpkoala commented 1 year ago

@getvictor I just encountered this error when attempting to delete 503 hosts from a local environment.

{
  "message": "Bad request",
  "errors": [
    {
      "name": "base",
      "reason": "list of ids or filters must be specified"
    }
  ],
  "uuid": "b26e390d-6163-4b37-9450-07b7c6c3934f"
}

xpkoala commented 1 year ago

Second PR has this looking good!

fleet-release commented 1 year ago

Bulk delete mishap, Now in the cloud, hosts sleep, Fleet ensures no lapse.

fleetdm / fleet