fulcrologic / fulcro

A library for development of single-page full-stack web applications in clj/cljs
http://fulcro.fulcrologic.com
MIT License
1.55k stars 140 forks source link

FEATURE: Improve error support for dynamic code in dynamic routing #28

Closed awkay closed 7 years ago

awkay commented 7 years ago

Add some kind of hook for cases where we know things are possibly not recoverable: E.g. a network error caused dynamic routing code loading to fail. Basically allow the program to transact, or js/alert, or something that let's the user know things are hosed, and what can be done.

mitchelkuijpers commented 7 years ago

I found out that cljs/load returns a https://google.github.io/closure-library/api/goog.async.Deferred.html which would make it easy to handle errors because you can add a error and a success handler. I am not sure though where we would add the hook for when a module load fails. Maybe a mutation that you can give to the route-to mutation.

awkay commented 7 years ago

Yeah, I was noticing that the load manager has a way to register for events related to load failures as well, but the deferred might be a better mechanism since it could probably localize the problem to the specific triggering event.

Then again, this is more of a global problem...your app is dead if loading code failed. So, it almost seems like registering mutations under a specific well-known name might be the most general way to handle all kinds of global failures. Thoughts?

mitchelkuijpers commented 7 years ago

I just tried out what happens when a load fails, and if it fails to load a module because of network issues and you trigger it again when you have internet it will just work. So maybe we need to add something to the route-to mutation, something along the lines of a fallback mutation. Because you can recover from failing to load a module.

awkay commented 7 years ago

So I was just adding comments around that basic idea to the reference guide about networking in general. Because of the optimistic update model: an idempotent set of mutations means that if you write networking to just retry with exponential backoff, things will eventually just clear up and continue properly. Unfortunately, in the general case, it cannot be automated...you have to make sure your mutations are idempotent.

BUT, for code loading: you're right. There is nothing that gets "stuck" or screwed up by retrying a code load.

I was thinking of trying to do a better job of error handling in general. The global network error handler seems heavy handed. I know D.N. had some plans around error handling that never got coded into Om Next, but I don't know what they were.

The issue really has to do with UI. Sure, we can retry. But if the network is down showing something that tells the user what is going on is really what I care about.

My ideas at the moment are akin to the global error handler, or a well-defined mutation you can defmutation into existence to handle classes of errors.

I mean, certainly we can add a network retry to code load. In fact, isn't there already one? I seem to remember seeing Google's loader retrying on something. In either case, retry isn't the issue for me as much as letting the developer control the UI in response to the problem. From there, I'm also thinking having them trigger the desired recovery might be desirable as well.

awkay commented 7 years ago

So I talked to David Nolen about his (unimplemented) ideas around Om Next error handling. Here are his exact words:

" errors are just in the tree period. we get back the props/error tree from the server on the client we split apart the props tree and the error tree - preserving all structure (edited) props tree will be persisted error tree won’t, but we use it for error look ups when a component life-cycle methods fire, we’ll check the error tree to see if we need to pass error value this gives us the “one shot” we want and the error tree will disappear after render cycle is complete"

So, his ideas were around errors from query sources, though in general since general data merge is all that queries use, and we can leverage merge for novelty at any time, it is possible that this mechanism could be used for more general error handling; however, he was explicit that he didn't consider "outage/network" kinds of errors to be the target of this error handling.

So, the basic implementation would be:

  1. A query goes out. Server(s) respond. A tree of data (possibly containing in-place errors) comes back. E.g., in principle something like

[:a :b] -> {:a 1 :server.error "b unavailable"}

  1. The merge functionality creates two trees: {:a 1} and {:server.error ...}. The data tree is merged (as normal) and the error tree is simply "remembered" in Om Next's internals.

  2. Render cycle: The parser generates a data tree to feed the UI. Om Next post-processes the data tree (your parser generates it, so it can't interface before then) and adds the error info back in. This prevents the errors from being something you have to support, persist, and clear in your app database. If you need to remember the error for some time period, you can use things like component local state.

  3. The errors are cleared. (future renders will not see them, but your lifecycle could have cached them for multi-step processing)

So, the "pending" error support in Om Next is, as one might expect, centered around the idea of the tree query out, tree response in. This doesn't explicitly give a path for dealing with the kinds of other errors (outage, network, etc); however, it does provide a consistent possible render model that is database agnostic and does not pollute your app state with errors.

It is possible that such support could be included in app state history (so time travel includes it).

From my perspective this is of limited utility for query problems that are not related to outages. If your UI sends a query that it isn't supposed to (auth or structure), then that is a bug. The vast majority of requests won't have errors in them...we're more concerned that they just won't succeed at all. The one caveat is micro-services: if n servers are involved in responding to a query then it is possible that some parts of the tree response could be "missing", and in that case it would be nice in those cases to be able to see those errors in-situ.

So, now I'd like to explore how this might help in the more general problem set:

  1. Errors from mutations
  2. Errors due to outages (network, database, web servers) (e.g. typically detected by network layer)

Mutations come back with a response in the form {'mutation-symbol return-value}. Unless you've augmented the merge story, these values are discarded. There are two ways to indicate an error: throw an exception on the server (or otherwise cause a bad status code), or return a value that indicates an error. The former can be detected by fulcro's global network error handler, but that centralizes all error handling for all mutations to one spot (and the idea here is that this kind of error typically means something went wrong enough that the user should probably just reload). The latter is gentler, but requires you write merge code for each mutation that might return a value (typically as a multimethod). To date, this has not been the source of many reported issues, and may be sufficient.

The addition of "tree-based" error support would make it possible to have an alternative that somehow triggers a merge on mutations where the mutation can provide a query/tree combo that goes with some part of the UI. This would allow the mutation to target error messages that would be transient and would not require a presence in the database. From Fulcro's perspective (we have a known database format), the fact that errors are typically placed in app state doesn't seem to be particularly problematic.

Outage-based errors: In this case send detects the error, and it could technically "make up" a response tree in order to place transient errors that are targeted at components. In practice this seems like a lot of extra work for little gain. These kinds of errors are global errors. Your app isn't going to work until the outage is over. I don't see how the above mechanism really could be leveraged to add much.

Code loading problems (e.g. the dynamic routing issue that prompted this discussion) is really an outage-based error. Nothing is going to be working right if they can't talk to a server (unless you've coded it to work "offline", but that is an even more complex topic).

As was noted earlier in this issue: simply retrying the load will work once the network has re-appeared.

So, at the moment I'm not seeing any general improvements that could significantly help our overall error handling story. Addressing the present problem of route load failure is probably the best we can do here.

awkay commented 7 years ago

@mitchelkuijpers So, about your idea of something akin to a fallback. So, we haven't moved the route yet (it is in pending route). Really, we could just clear the pending route. That would make it so the user "clicking again" could retry.

claudiuapetrei commented 7 years ago

@awkay From my perspective:

The optimistic update focus is really giving me a hard time :)

For example, a pretty standard webapp flow:

Even if it's 2 failed for 1000 requests this approach seems strange:

Optimistic updates sound really awesome for some actions, ex: like article, add to collection. But for others the complexity added to do proper disaster recovery seems to me like it's not really an option. And given that most of the times I have to implement a design and flow that was not built with optimistic updates in mind makes it even more problematic.

For some reason I mapped query=load data, mutation=update data. Only after talking on slack I realised that it should be perfectly resonableon the server to have mutations.clj that has defquery login or defquery create-article.

Personally I think defmutation & defquery on server add a lot of confusion. The way I think about them is more like:

awkay commented 7 years ago

So, I don't want to get too far off track on other APIs, but I would comment on optimistic update a bit. The intention is to give the user a fast responsive experience. The default networking does not auto-retry mutations because it doesn't know if your mutations are safe to repeat (e.g. are idempotent). As I said earlier, an idempotent mutation set with auto retries can make a pretty robust system; however, you are correct that if you give the user the impression that you're done and they close the browser, then some data loss could occur in an unexpected way.

Have you used google docs? The do exactly this model: optimistic update with background save. You wouldn't want the UI blocking while you're typing something. The trick there is to let the user know when saves are actually complete. That way they can keep working, but can also see that there is network activity. This is a relatively painless UI change. Just have some clear indicator that work is still in progress, but that you're allowing them to continue. You can also block the browser tab/window closing with an onWindowClose handler, to warn them that networking is in progress.

If you must "block" the UI to ensure save before continue, then that is simple as well: just don't include the optimistic (local) side of the mutation! Send it to the server and follow it with a follow-on read (as you suggest, you could also do this with a parameterized query and post mutation). This will do the traditional "lock up the UI while networking happens".

Finally, there is the actual error handling itself. There are a number of interesting things you could do to deal with this. One is time-travel :) You have the UI history. You could "transactional timestamp" the network requests with the history UUID, and if something fails, just back them up to the point in their application use! How cool is that?

I do agree with you that telling the user that something is complete when it isn't is probably going to lead to problems. However, rather than making it act like something old, I'd prefer to give them a better experience and inform them that it is something new (and more responsive).

claudiuapetrei commented 7 years ago

@awkay The transaction time-stamp idea is really really cool, thank you :)

I know it's not ideal... but for some projects not a lot of time is allocated & also this implies that people doing the design understand the concept of optimistic updates, unfortunately that is not the case for most projects.

petterik commented 7 years ago

Reading this thread, I see that we've ended up thinking or doing similar things with om.next. Here's what we've done and maybe it can help in some way.

We've implemented "transactional timestamp" for our plain om.next app and it's been working really well. I was inspired by this talk from Lee Byron: https://vimeo.com/166790294 skip to 21:45

Here are some thoughts:

For error messages, we have mutations returning messages that follow this protocol:

(defprotocol IMutationMessage
  (message [this] "Returns the message")
  (final? [this] "true if the mutation message was returned from the server")
  (pending? [this] "true if we're still waiting for a server response.")
  (success? [this] "true if the server mutation was successful. False is error occured."))

A message is either :pending, :success, :error. These messages are merged in to the app state and can be retrieved by either mutation-key (to get a list of all messages for the mutation) or by [mutation-key history-uuid] pair to get a message for an exact mutation.

For us it's always the component that fires the mutation that is interested in the result, so we've made a namespace to make this easier. In this namespace there's a function with signature (om-transact! [component tx]) for transacting a mutation and adding the history-uuid to the component's state, so we can look up messages later with (all-messages [component mutation-key]) which finds all messages for a specific mutation key that the component has transacted. Since this is usually just 1 we've also added (last-message [component mutation-key]).

We don't really handle network issues, code loading problems or server unavailable problems yet, so I've got nothing to add there. If a user looses their internet connection, we loop forever in our send! until we've got an internet connection back. With optimistic updates and reads, it'll make the app seemingly usable for as long as the user doesn't do a non-optimistic update. More can be done here for sure.

I hope this helps validating or thinking about these ideas.

awkay commented 7 years ago

@petterik Great talk...that's pretty much the arch we've had for 2 years ;) Just still finding out what all you can do with it!

awkay commented 7 years ago

@petterik So, do you really find that the rebase behavior is necessary or even optimal? My perspective is that the UI should already know (in advance) if a mutation should be "legal"...e.g. failure of a mutation should either be a bug (you should not have let the user issue it) or an outage-related event (network fail, server outage, etc.).

I also wonder about the limits of optimistic update on lag. The user is eventually going to do something that needs a server read. If that gets backed-up, then letting them continue to queue up more and more seems kind of dangerous...they might end up waiting through 10 minutes of backed up stuff they don't care about (nav to A, delayed...so nav to B, delayed, etc...network comes back, now all those loads and such start processing).

I can't think of many cases where the optimistic update is going to be "off". In the talk, he didn't realize that Om Next invented a way to handle the fact that IDs were not available on creation (he mentions waiting to get the ID back). Fulcro augmented Om Next by ensuring that tempids that appear in the network queue get rewritten on tempid remappings...so you can actually queue a create and update in sequence and have them work in complete optimistic mode.

Thus, in theory, it seems to me that a Fulcro/Om Next app with these features (tempid remapping to include the network queue) should issue a sequence of legal operations, which it has correctly applied locally. If there is a need to wait for a read, then you block the UI with a loading marker. not much you can improve there.

In the talk, the assumption is that your mutations might return something different from what you'd done optimistically. I think this is wrong (as in I would consider that a bad implementation). If you're unsure of the server state after a mutation, your app should be implemented with a follow-on remote read (in queue) to ensure that the app state becomes up-to-date.

I agree that his model is technically correct, but it's harder to reason about because your app history is no longer clean in the sense of linear progression. The rebasing causes all sorts of churn in history (just as it does in git). I guess you could bypass recording these rebases, but then you have to consider that the app state changes in weird ways that are not directly traceable from the mutation history (e.g. some snapshot in time you see A change (due to an async rebase) but only mutations that were changing B ran in that time span). Now you're back in async problem space.

With a simple follow-on read model the network queue and mutation history maintain a nice linear progression. Technically, a merge handler for mutations could accept data as return values (which fulcro gives you a hook for), and that doesn't hurt anything either (the post-mutation merge is a linear progression step).

If we take my earlier assertion: mutations should not fail unless there is an outage-level event, then we can do things in a much simpler fashion:

  1. Make send auto-loop with exponential backoff iff it is a network-related (recoverable) error
  2. Any other kind of error should clear the remaining operations in the network queue, and "rewind" application history to the point of "last known good".

Now, what does this look like in (practical) reality?

Yes, the user potentially loses work...but there's not much you can really do about it if it wasn't a network-related recoverable operation. If you've cached the mutation history (for replay), then technically the dialog can allow the user to at least retry the sequence. The assumption here is that a server error (not network related...maybe db was down) could produce a problem that you could (eventually) resolve, but you would not want to completely automate (in case the user either decided to give up or it was obvious things were hosed).

You have the UI history, so you can "back up". You can record the transaction history, so you have a replay buffer (both of which you're essentially forking the second you start your "retry" dialogs).

Now you don't need any special mutation "protocol" at all: network errors are network errors. Security, outage, and bugs are exceptions (server 500). The former are auto-retry, the latter are possibly auto-retry a few times, but if that's not working you can give the user an option of how to proceed.

If it is a true security problem, then who cares: it's probably a hacker. 500 doesn't hurt your real users. If it's a true outage, then you're giving them the best possible handling of it: retry until you're satisfied things are too screwed.

Thoughts?

awkay commented 7 years ago

OK, so I changed the implementation to retry a route's loads until such time as they succeed, or the user changes the route. The ModuleManager automatically retries for about 30s. I added a little expoential back-off, and an attempt count, but I didn't enforce a count limit, and the back-off is more of a safety measure (in case goog ModuleLoader changes how it handles a failure).

I'm going to move the other converation about general error handling improvements to a new issue.

awkay commented 7 years ago

Please continue error handling general discussion in #34