Closed tangrammer closed 4 years ago
inviting @finnfiddle @kardan @muloem @dlebrero to this party as far as @iperdomo suggested that you should be interested in slow networks and chatty client scenarios like this one
If this is usual, we have a concurrency issue, as the client requests are not ordered, so the latest "final" request from the client point of view can end up as the first request from the backend's point of view.
I do think it's important to realise that one request is one user action & the PUT payload includes what the client thinks is the state of the world. So from an application state point of view different ordering will/could mean that some odd things might happen. I'm not sure if the UI is getting updated based on the response or if the UI is continuing on it's own path thinking that the database is up to speed.
I do think it's important to realise that one request is one user action & the PUT payload includes what the client thinks is the state of the world.
This needs to be changed. A user fiddling with the options/parameters for a visualization is a in progress (draft) job, once she/he is happy with the state of the visualization, we can save it. We know it, we have suffered it in the Survey editor in Flow, and somehow that knowledge has not influenced the implementation decisions in Lumen. cc/ @muloem
@iperdomo I'm not fully sure, but as far as I know, lumen needs to push data to the server to get sync graphics in the client. So don't know how useful could be to have a draft job with no "real" feedback about the state of our options/data. anyway, I imagine that maybe you are proposing some sort of reducing at the minimum the interactions with the server but keeping the same functionality 🤔 ... maybe indexing requests in client-side and setting throttle ... thinking aloud
@tangrammer why is that the case? I'm aware of that limitation when using maps, as Windshaft needs to generate the new tiles (PNG images) for the changed options. But why do we need to to the same for the rest of visualizations?
@iperdomo not sure that it only affects to maps, thus any viz depends on backend sql process. Even bar
types https://github.com/akvo/akvo-lumen/blob/develop/backend/src/akvo/lumen/lib/aggregation/bar.clj
bar type example:
in client-side, for every option change, it sends a PUT request with the new form changes /api/visualisations/vis-id
and then ask for the new data viz with a GET request /api/aggregation/vis-id/bar?query=....
to make the new viz
I think the prefer approach from the UX team is to not have a "save" button.
We moved the machinery for visualisations to the backend. Now we aggregate on the backend and only send the data needed for rendering the visualisation to the client. The same is happening in the visualisation editor. So earlier we did only do things in the client and then persisted when the user was happy. Today we do a PUT on every user interaction.
Implementing proper conflict handling (version number or hashing the visualisation spec), would that not solve this issue?
Conflict handling is an option, but make sure that both the client and the server handle it properly. Note that the server must understand which one is the latest version, so hashing will not work. We need some kind of monotonically increasing number, aka version number or timestamp (if we feel brave).
Also, just because the backend is doing the aggregations now, it doesn't mean that it needs to store the visualisation work in progress (WIP) in the same place as the "final" version, neither it means that it needs to store it at all.
Depending on requirements I see the following possible options:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Context
Full stack trace of the problem can be found here https://sentry.io/akvo-foundation/production/issues/753796246/
Research
Applying this filter in gcloud: (pay attention to time filter too)
we get following Logs (json, csv) stored in gdrive
We can realise that we have 10+ identical requests in the same second
Here the PUT requests with 500 status
Conclusions
Looking at the sql code related, line 16 starting in history.log_change function definition and examining previous logs attached above, we could derive that we got some race condition at database side that generates these errors but with no more consequences, or in other words final-user didn't realise about the backend error nor database was damaged Then, if the problem was a race condition, we shouldn't call it an error but a succesful behaviour and therefore we shouldn't send it to sentry service as an error but this improvement seems not easy to accomplish and the kind of corner scenario that we had this time (15 repeated PUT calls in the same timestamp) doens't seem to worth the effort to add more improvements in sentry communication side
Anyway, this kind of chatty behaviour between client and backend on slow networks should be an important topic to take care about it.