akvo / akvo-lumen

Make sense of your data
https://akvo.org/akvo-lumen
GNU Affero General Public License v3.0
63 stars 18 forks source link

Tardis, history.log_change PSQLException: range lower bound must be less than or equal to range upper bound #1754

Closed tangrammer closed 4 years ago

tangrammer commented 5 years ago

Context

Full stack trace of the problem can be found here https://sentry.io/akvo-foundation/production/issues/753796246/

org.postgresql.util.PSQLException: ERROR: range lower bound must be less than or equal to range upper bound
  Where: PL/pgSQL function history.log_change() line 16 at IF
  File "visualisation_impl.clj", line 40, in akvo.lumen.lib.visualisation-impl/upsert
    (let [v (upsert-visualisation tenant-conn
  File "visualisation_impl.clj", line 39, in akvo.lumen.lib.visualisation-impl/upsert
    (defn upsert [tenant-conn body jwt-claims]
  File "visualisation.clj", line 23, in akvo.lumen.lib.visualisation/upsert
    (impl/upsert tenant-conn body claims))
  File "visualisation.clj", line 20, in akvo.lumen.lib.visualisation/upsert
    (defn upsert
  File "visualisation.clj", line 33, in akvo.lumen.endpoint.visualisation/[fn]
    (visualisation/upsert tenant-conn (assoc body "id" id) jwt-claims))
...
(98 additional frame(s) were not displayed)

ERROR: range lower bound must be less than or equal to range upper bound
  Where: PL/pgSQL function history.log_change() line 16 at IF

Research

Applying this filter in gcloud: (pay attention to time filter too)

screen shot 2018-11-06 at 13 05 25

we get following Logs (json, csv) stored in gdrive

We can realise that we have 10+ identical requests in the same second

screen shot 2018-11-06 at 13 39 50

Here the PUT requests with 500 status

Conclusions

Looking at the sql code related, line 16 starting in history.log_change function definition and examining previous logs attached above, we could derive that we got some race condition at database side that generates these errors but with no more consequences, or in other words final-user didn't realise about the backend error nor database was damaged Then, if the problem was a race condition, we shouldn't call it an error but a succesful behaviour and therefore we shouldn't send it to sentry service as an error but this improvement seems not easy to accomplish and the kind of corner scenario that we had this time (15 repeated PUT calls in the same timestamp) doens't seem to worth the effort to add more improvements in sentry communication side

Anyway, this kind of chatty behaviour between client and backend on slow networks should be an important topic to take care about it.

tangrammer commented 5 years ago

inviting @finnfiddle @kardan @muloem @dlebrero to this party as far as @iperdomo suggested that you should be interested in slow networks and chatty client scenarios like this one

dlebrero commented 5 years ago

If this is usual, we have a concurrency issue, as the client requests are not ordered, so the latest "final" request from the client point of view can end up as the first request from the backend's point of view.

kardan commented 5 years ago

I do think it's important to realise that one request is one user action & the PUT payload includes what the client thinks is the state of the world. So from an application state point of view different ordering will/could mean that some odd things might happen. I'm not sure if the UI is getting updated based on the response or if the UI is continuing on it's own path thinking that the database is up to speed.

iperdomo commented 5 years ago

I do think it's important to realise that one request is one user action & the PUT payload includes what the client thinks is the state of the world.

This needs to be changed. A user fiddling with the options/parameters for a visualization is a in progress (draft) job, once she/he is happy with the state of the visualization, we can save it. We know it, we have suffered it in the Survey editor in Flow, and somehow that knowledge has not influenced the implementation decisions in Lumen. cc/ @muloem

tangrammer commented 5 years ago

@iperdomo I'm not fully sure, but as far as I know, lumen needs to push data to the server to get sync graphics in the client. So don't know how useful could be to have a draft job with no "real" feedback about the state of our options/data. anyway, I imagine that maybe you are proposing some sort of reducing at the minimum the interactions with the server but keeping the same functionality 🤔 ... maybe indexing requests in client-side and setting throttle ... thinking aloud

iperdomo commented 5 years ago

@tangrammer why is that the case? I'm aware of that limitation when using maps, as Windshaft needs to generate the new tiles (PNG images) for the changed options. But why do we need to to the same for the rest of visualizations?

tangrammer commented 5 years ago

@iperdomo not sure that it only affects to maps, thus any viz depends on backend sql process. Even bar types https://github.com/akvo/akvo-lumen/blob/develop/backend/src/akvo/lumen/lib/aggregation/bar.clj

bar type example: in client-side, for every option change, it sends a PUT request with the new form changes /api/visualisations/vis-id and then ask for the new data viz with a GET request /api/aggregation/vis-id/bar?query=.... to make the new viz

dlebrero commented 5 years ago

I think the prefer approach from the UX team is to not have a "save" button.

kardan commented 5 years ago

We moved the machinery for visualisations to the backend. Now we aggregate on the backend and only send the data needed for rendering the visualisation to the client. The same is happening in the visualisation editor. So earlier we did only do things in the client and then persisted when the user was happy. Today we do a PUT on every user interaction.

kardan commented 5 years ago

Implementing proper conflict handling (version number or hashing the visualisation spec), would that not solve this issue?

dlebrero commented 5 years ago

Conflict handling is an option, but make sure that both the client and the server handle it properly. Note that the server must understand which one is the latest version, so hashing will not work. We need some kind of monotonically increasing number, aka version number or timestamp (if we feel brave).

Also, just because the backend is doing the aggregations now, it doesn't mean that it needs to store the visualisation work in progress (WIP) in the same place as the "final" version, neither it means that it needs to store it at all.

Depending on requirements I see the following possible options:

  1. No WIP concept, no save button: each client edit overwrites the "final" visualisation in the backend. Both client and server must handle conflicts.
  2. WIP concept, user edits visualisation until he is happy, then "save" to "final" version. 2.1. WIP stored in client: client to store WIP in the browser's local storage. The client will still send those PUT requests to the backend but the backend will just reply, without storing the DB. Note that the client must still handle conflicts. 2.2. WIP stored in backend: backend to store the WIP in some new table. Both client and server must handle conflicts
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.