Closed xpkoala closed 1 year ago
IIRC this is an issue also happening on previous releases. (Please correct me if I'm wrong.)
IIRC the frontend is making two delete hosts API calls (should be making only one).
@RachelElysia : If this is a FE issue it's a 1pt
@lucasmrod : I can provide a script to create 100 host (to be deleted for repeo)
@lucasmrod : I can provide a script to create 100 host (to be deleted for repro)
So it looks like this might be a Chrome specific issue with network connections.
Private Zenhub Image Private Zenhub Image
I looked up how to fix net::ERR_NETWORK_CHANGED
error showing up in the ChromeOS console tab since there were no errors surfacing in the ChromeOS network tab and came across a video saying it might be Chrome specific.
Next steps: Testing on safari and ff, and trying to modify my network settings to see if it fixes it.
This time on ChromeOS, it was a different error:
Quick google search suggests: "It turns out that some webservers will cut the connection to a client if the server's data throughput to the client passes below a certain limit. This is to protect against "slow drip" denial of service attacks. However, this limit can also be triggered in cases where an innocent user requests many resources all at once (such as lots of images on a single page), and the server is forced to ration the bandwidth for each request so much that it causes one or more requests to drop below the throughput limit, which causes the server to cut the connection and shows up as net::ERR_HTTP2_PROTOCOL_ERROR in Chrome."
Update: Had similar issue trying to delete 3000 hosts on FF. Successfully finished deleting 2708 hosts before erroring out.
Seems it might be a backend issue after all?
Same issue with Safari (see screenshot below)
My conclusion is it's 1) not a front end issue 2) not a browser issue.
My best guess is it might be a webserver limit issue but I have no experience on that end.
@sharon-fdm can I reassign this to you to chat with the infra
team to dig into it?
I would assign backend, because you are reproducing in localhost AFAICS so there's no infra involved here.
Need product input.
This behavior is due to our timeout settings on the Go server, like http.ReadTimeout
and http.WriteTimeout
. This results in the server killing the connection to the client after ~1.5 minutes. However, the delete operation continues to run on the server.
Since we are exposing our API, we should keep the timeouts to guard us against buggy/uncooperative clients.
Recommended backend fix: instead of killing the connection, use http.TimeoutHandler
to return 503 (that's what that library method returns).
On the frontend, we cannot wait longer than our timeout for a response. One quick fix is to tell user that the operation is taking a while and will continue in the background.
My debug notes for reference: https://docs.google.com/document/d/1vsF29Bhk5MiTrApfkTMJ8IRc293_gsIo32UiWomNg18/edit
@rachaelshaw Additional thoughts on this issue:
This timeout "feature" is present for REST API users and probably for command-line users as well. We should do a scan of the API to see what other methods may be slow when modifying thousands of records.
For delete endpoint, we can add a custom timeout to return a 202 (Accepted) response. This would be in addition to the larger global timeout that would return a 503.
If frontend sees a 202 response, they can inform user that the operation continues to run in the background.
Let's go with @getvictor's suggestion:
For delete endpoint, we can add a custom timeout to return a 202 (Accepted) response. This would be in addition to the larger global timeout that would return a 503.
If frontend sees a 202 response, they can inform user that the operation continues to run in the background.
Here is the UI change to go along with it: Figma
(I'll add this to the issue description as well)
@sharon-fdm there is now an additional UI element to this, may need re-estimation
Estimation for FE+BE : 2 pt
@getvictor I just encountered this error when attempting to delete 503 hosts from a local environment.
{
"message": "Bad request",
"errors": [
{
"name": "base",
"reason": "list of ids or filters must be specified"
}
],
"uuid": "b26e390d-6163-4b37-9450-07b7c6c3934f"
}
Second PR has this looking good!
Bulk delete mishap, Now in the cloud, hosts sleep, Fleet ensures no lapse.
Fleet version: fleet-v4.37.0-132-gb6f072775-dirty
Web browser and operating system: n/a Safari 16.2
💥 Actual behavior
When deleting a large number of hosts, the waiting modal will eventually break but the host deletions will continue in the background.
🧑💻 Steps to reproduce
🕯️ More info (optional)
The video and screenshot were taken at different times, hence the host count being off between the two.
Seen below is a view of the modal breaking.
After the above break the user can refresh the page and see that the delete action is continuing in the background. https://github.com/fleetdm/fleet/assets/11012743/6c3beeec-2dc8-4e8b-91ad-7af0b87077f4
🛠️ To fix
Back-end:
Let's go with @getvictor's suggestion:
Front-end:
Update "Delete host" modal to show an additional message when a large number of hosts is selected: Figma