getodk / central

ODK Central is a server that is easy to use, very fast, and stuffed with features that make data collection easier. Contribute and make the world a better place! ✨🗄✨
https://docs.getodk.org/central-intro/
Apache License 2.0
127 stars 155 forks source link

Uploading 100k Entities, frontend reports 504 and stays on preview screen, upload has actually succeeded #691

Open lognaturel opened 4 months ago

lognaturel commented 4 months ago

Problem description

I uploaded https://drive.google.com/file/d/1y2Z9ZwHcX60FRW5F2vxbolooj3F6-bgY/view?usp=drive_link which has 100k Entities. After a minute, I saw a 504 at the top of the modal with the append button. I exited the modal and saw my Entities were successfully created.

URL of the page

https://staging.getodk.cloud/#/projects/93/entity-lists/entities_100k/entities

Expected behavior

~In the case of a 504, I think we should close the modal if possible. Because there's no duplicate detection, a user is at high risk of uploading the same Entities twice and then they're stuck with them.~

Alternately, could the server send back something to say it's still working on it?

Central version shown in version.txt

versions:
e49518adb84f88d7bc6c3626fc77584dfc935435 (v2024.1.0-6-ge49518a)
+988780e2d439894ef6e5af15692a8916c7c8d8e5 client (v2024.1.0-23-g988780e2)
+fb96aa3333d3442ea43365755766262f21de3969 server (v2024.1.0-23-gfb96aa33)

Browser version

Around when did you see the problem (in UTC)?

Other notes (if any)

matthew-white commented 4 months ago

If the upload succeeded, then I don't think that Backend itself would have returned an error response. Otherwise the database transaction should have been rolled back. I think that nginx is returning a 504 after a set amount of time without waiting for Backend to finish.

After a minute, I saw a 504

Could it have been 2 minutes? That's how nginx is configured. Was the error message "Something went wrong: error code 504." ? If so, that's another sign that it's nginx. An error from Backend would be more specific.

If it's nginx, here are a couple of ideas for how to address it:

I'm not sure that closing the modal would necessarily help, because the user could just reopen the modal or even refresh the page in order to try again. Backend would still be working on the original request. getodk/central#785 is another example of how Backend can be working on concurrent requests even after a 504 response.

lognaturel commented 4 months ago

"Something went wrong: error code 504."

Yes, exactly. I’m quite sure it’s nginx, there was nothing in the service log.

Trickling a response like the backup endpoint would make a lot of sense.

I’m less sure of the implications of modifying the timeout.

matthew-white commented 4 months ago

I’m less sure of the implications of modifying the timeout.

I feel like we've considered this idea before, though I don't remember why we didn't make this change. How long does it take Postgres to time out? Would it be reasonable to change the nginx timeout to match the Postgres timeout, at least for non-GET requests?

Trickling a response like the backup endpoint would make a lot of sense.

With the backup endpoint, we trickle random data that winds up in the backup .zip file. I don't think we'd want to return random data or a .zip file from the upload endpoint. But I think the current response from the upload endpoint is {"success":true}, and we could trickle that out, returning a character every minute or so. Maybe that would be the easiest change to make.


For some reason, I'm a little surprised that an upload request would take as long as you're seeing. Maybe we knew that already and I'm just forgetting. 😅 Once we allow the request to take more than 2 minutes, I'm wondering whether we should do more to signal to the user that the request really is in progress and that they definitely shouldn't refresh the page and try again. For example, after 30 seconds or a minute, we could change "Processing file..." to "Still processing...", or we could show an alert that mentions "don't refresh".