RFC: adding built-in error handling support to zones

domenic commented 8 years ago

This is a request for comments, especially from the Node community (/cc @bmeck @trevnorris; please bring in others). Our original plan of leaving error handling out of the base zone primitive, and letting it be handled entirely by host environments or future spec extensions, is starting to show some cracks. This is mainly because it impacts the recommendations around how to "wrap" functions for zones. We would like the base zones proposal to have a strong recommendation: "to properly propagate zones while doing async work, do X."

The problem

Currently our recommendation for how to propagate zones is "use Zone.current.wrap", but @mhevery has shown me how that doesn't quite work in certain scenarios once error handling is introduced. In most cases of user-mode queuing, you want wrapped functions to send any errors to the relevant zone. However, in some cases, mainly when building control flow abstractions like promise libraries, you want to handle errors yourself.

So, if we later introduced error handling, some fraction of the uses of Zone.current.wrap would be wrong.

Similar problems apply to zone.run, since one of the important uses of run for library authors is to manually do wrapping (i.e., store the callback and current zone, then do storedZone.run(storedCallback), instead of storing the wrapped callback).

The solution

The way to fix this is to make the two use cases explicit. Instead of wrap and run, we have wrap, wrapGuarded, run, and runGuarded, where the non-guarded variants are used when the library explicitly wants to handle thrown exceptions: like for implementing promises or observables or similar, where you transform thrown exceptions into a different form.

However, it's pretty pointless to introduce these two functions if the zones proposal doesn't actually have error handling built-in. So this takes us down the path of introducing error handling into the base zone primitive, instead of saving it for a future extension.

In other words, keeping things in a future extension is fine, unless that requires you to change your code now. If you require people to change their code now, you might as well give them the benefits (zone error handling) that you're asking them to pay for.

Details of error handling proposal

The zone fork options get a handleError option, which takes a thrown error object:

const z = Zone.current.fork({
  name: 'http request zone',
  handleError(e) {
    sendAnalytics(e);
    return true; // error handled. Or, should we just make you rethrow to indicate not-handled?
  }
});

This is pretty simple. The trick is then figuring out when and how we should route errors to the error handler. To discuss that, we need to talk about the “guarded” functions introduced above.

Details of {wrap,run}{Guarded}

Currently, we have (essentially)

Zone.prototype.run = function (f) { // current
  const oldZone = Zone.current;
  setCurrentZone(this); // privileged API
  try {
    return f();
  } finally {
    setCurrentZone(oldZone);
  }
};

Zone.prototype.wrap = function (f) {
  const thisZone = this;
  return function () {
    return thisZone.run(() => f.apply(this, arguments));
  };
};

We would then introduce:

Zone.prototype.runGuarded = function (f) { // new
  const oldZone = Zone.current;
  setCurrentZone(this); // privileged API
  try {
    return f();
  } catch (e) {
    // actually stored in an internal slot, not a _-prefixed property
    if (!this._handleError || !this._handleError(e)) {
      throw e;
    }
  } finally {
    setCurrentZone(oldZone);
  }
};

Zone.prototype.wrapGuarded = function (f) {
  const thisZone = this;
  return function () {
    return thisZone.runGuarded(() => f.apply(this, arguments));
  };
};

How to think about these

The TL;DR is: most libraries use wrapGuarded. Most apps use run.

In a bit more detail: the user code and the library collaborate to figure out how errors are handled, with the following inputs:

Does the library know how to handle errors? If so, it will use wrap (or run), in combination with a try/catch at the call site. Otherwise (most cases, like setTimeout or event listeners), it will use wrapGuarded (or runGuarded), to say: “I don’t know how to handle the error, and if I let it propagate up the call site it would just immediately reach top level, so instead let’s route errors to the right zone.”
Does the user code want the zone to handle errors? If so, it will use runGuarded to kick things off. Otherwise, it will use run, and let the error propagate up the call stack as usual. Using runGuarded in this way is fairly unusual; it’s a kind of “async try/catch” and programs are generally better served by letting the error bubble.
Example in action

This example shows how, if you follow the TL;DR above, everything “just works”:

/////// LIBRARY CODE

sql.addListener = function (f) {
  this.storedFunction = Zone.current.wrapGuarded(f);
};

sql.doStuff = function () {
  this.storedFunction();
};

/////// APP CODE

const rootZone = Zone.current;

const zone1 = rootZone.fork({
  handleError: handleError1
});
const zone2 = rootZone.fork({
  handleError: handleError2
});

zone1.run(function a() {
  sql.addListener(function b() {
    rootZone.run(function c() {
      throw new Error("boo");
    });
  });
});

zone2.run(function d() {
  setTimeout(function e() {
    sql.doStuff();
  }, 0);
});

At the time the error is thrown, the call stack is:

c (top)
rootZone.run
b
this.storedFunction (wrapper around b to run it in zone1)
sql.doStuff
`e
wrapper around e to run it in zone2 (generated by setTimeout)
setTimeout task (bottom)

Error propagation and handling then occurs like so:

The error is not caught by rootZone.run (run does not handle errors at all).
The error is next caught by this.storedFunction, which is a wrapper around b to run it guarded in zone1. That sends it to errorHandler1.
If errorHandler1 doesn't return true, the error is next caught by the wrapper around e to run it in zone2. So it's sent to errorHandler2.
If even errorHandler2 doesn't try it, we call the error unhandled, and it goes to window.onerror or "uncaughtException" as usual.
Issues to discuss

Most importantly: does this sound like something that is acceptable to potential zone-using communities? We’d like to have everyone on board, and we spent a lot of time trying to get the details right here (drawing on things like the domain module postmortem from Node.js), so hopefully it’s not that bad.

Less important issues:

How should unhandled/handled be done? I did returning true for handled, false for unhandled, but I think another plausible design is that it’s handled by default, and you rethrow the error if you want it to be unhandled. (Assume for the purposes of discussion that we properly specify Error.prototype.stack as being captured at error creation time, not at error throwing time, so rethrowing does not hurt stack information.)
Should we consider only adding run + wrapGuarded, since those are the “happy path”? I think it’s weird to have an asymmetric API like that, and there are definitely use cases for the other two. But on the other hand, you can derive runGuarded from wrapGuarded, and derive wrap from run, so it’s not necessary.

indexzero commented 8 years ago

This is a lot to digest on a Friday afternoon (when I saw this). I'm back in NYC now, perhaps discuss next week IRL? As you mentioned this sounds a lot like domains which basically amounts to:

morpheus-dream

... hoping we can avoid those pitfalls here.

trevnorris commented 8 years ago

Landed the "official" domain postmortem today (https://github.com/nodejs/node/commit/4a74fc9). It has additional information and examples. I'm in the middle of writing a post for this issue that has Zone specific usage. There are some key differences that avoid some of the pitfalls of domains, and I'm still investigating how other different features would be effected.

trevnorris commented 8 years ago

Been writing up examples to make sure I understand the implications of the API, and have some questions before I can give a full assessment.

How are implementers supposed to know whether to run the callback in run() or runGuarded()? Currently propagating the correct Zone and then calling the associated callback within the zone is the responsibility of the implementer. But there's no indication I can see that will tell the user how the callback is supposed to be run. For example:

function DoStuff() {
  this._queue = [];
}

DoStuff.prototype.addone = function addone(fn) {
  this._queue.push({ zone: Zone.current, callback: fn });
};

DoStuff.prototype.runall = function runall() {
  while (this._queue.length > 0) {
    const item = this._queue.shift();
    const zone = item.zone;
    const callback = item.callback;
    zone.run(callback);  // Or should this be runGuarded?
  }
};

const ds = new DoStuff();

Zone.current.fork({ name: 'z1', handleError: (e) => console.error(e.message) })
    .run(() => ds.addone(() => console.log('d1')));

Zone.current.fork({ name: 'z2' })
    .runGuarded(() => ds.addone(() => console.log('d1')));

Here z1 has a handleError, but only calls using run(). So how is the implementer to know which to use? I can't see this defined anywhere.

trevnorris commented 8 years ago

It should also be mentioned that error routing is one reason why domains failed. Users are allowed to create a private object that prevents bubbling of other's exceptions.

///// LIBRARY /////

const libZone = new Zone({
    name: 'libZone',
    handleError(e) {
      // Log error and continue happily.
      return true;
    },
});

function ConnectDB() {
  // Make connection to database.
}

ConnectDB.prototype.query = function query(options, callback) {
  const obj = { options, callback, zone: Zone.current };
  libZone.fork().run(() => makeRequest({ obj, completed }));
};

function completed(obj, data) {
  const result = parseData(data);  // Uh oh. Bad code here that throws.
  obj.zone.run(() => obj.callback(result));
}

///// USER CODE /////

const zone = Zone.current.fork({ /* handleError 'n stuff */ });
const db = new ConnectDB();

zone.run(() => db.query({ /* options */ }, callback));

The difference between this and adding try/catch is that calling the user's callback from the handleError() is probably impossible (or even wanted?). Meaning the callback will never be called to let the user know their query completed. This leads to frustrating debugging scenarios.

You may say that the implementer is Doing It Wrong, and I'd agree with you. But the fact remains that it will be used like this, as were domains, and may possibly leave a bad taste in the developer's mouth preventing them from using it in the future, as it happened with domains.

In short, there are a lot of ways error routing can be abused by the user, and it's impossible to create an API that prevents it. Theoretically error routing is a great idea, but domains already went through that gauntlet and failed.

bmeck commented 8 years ago

@domenic just to be clear guarded functions are not applied to CallInZone? So promise reactions won't add a try/catch?

domenic commented 8 years ago

@trevnorris @bmeck sorry for the massive delay in response. I will be a lot more responsive going forward; many apologies.

How are implementers supposed to know whether to run the callback in run() or runGuarded()?

Basically: are you going to handle errors thrown by your callback with a try/catch, like a promise or observable library? Use run(). Otherwise, if your errors are just going to go to uncaughtException anyway, you should use runGuarded(), to let the appropriate zone get a crack at them.

But there's no indication I can see that will tell the user how the callback is supposed to be run. For example:

Whether you should use zone.run() or zone.runGuarded() here depends on how you expect ds.runall() to be used, but I would expect it to be runGuarded().

The situation where run() is appropriate is if you think callers of runall expect:

Any errors in any of the queued callbacks should stop all future callbacks in that runall from executing
And such errors will be propagated, so you can do try { ds.runall(); } catch (e) { ... } to catch that first-thrown-error.

Given that "DoStuff" is not a very real-world class, I can't be sure that your intention isn't the above, but I'd doubt it. More likely, runGuarded() is appropriate, as callers would not be expected to try/catch the runall() call, and callers probably prefer if execution continued without stopping on first throwing callback.

Here z1 has a handleError, but only calls using run(). So how is the implementer to know which to use? I can't see this defined anywhere.

Sorry if this wasn't clear. I tried to make it so, but it probably got lost in the wall of text.

You may say that the implementer is Doing It Wrong, and I'd agree with you. But the fact remains that it will be used like this, as were domains, and may possibly leave a bad taste in the developer's mouth preventing them from using it in the future, as it happened with domains.

In short, there are a lot of ways error routing can be abused by the user, and it's impossible to create an API that prevents it. Theoretically error routing is a great idea, but domains already went through that gauntlet and failed.

Yes, I'm aware of this concern, and that's the primary reason I wanted to ask Node people to review it. My position is that doing error routing correctly is indeed tricky and, in subtle cases, can definitely be done wrong. But I hope that simple advice will help, and coupled with fixing other issues domains have, it would still be worthwhile.

The alternative is to abandon error routing in zones entirely, which seems like a missed opportunity. If we want to add zones for all the other reasons (zone-local storage, async stack trace marking, async task tracking) then ideally we should be able to also solve the "something better than uncaughtException" problem at the same time, given all the machinery already in place. The problem is the extra cost of 4 methods (run/runGuarded + wrap/wrapGuarded) instead of 2, which I think is manageable, but wanted to see what others thought.

@bmeck

domenic just to be clear guarded functions are not applied to CallInZone? So promise reactions won't add a try/catch?

I don't really understand this question. What does "applied to CallInZone" mean? What does it promise reactions adding a try/catch mean?

bmeck commented 8 years ago

@domenic in your example function e is guarded by zone2 even though it is invoked via setTimeout, however I don't see how:

Zone.prototype.runGuarded = function (f) { // new
  const oldZone = Zone.current;
  setCurrentZone(this); // privileged API
  try {
    return f();
  } catch (e) {
    // actually stored in an internal slot, not a _-prefixed property
    if (!this._handleError || !this._handleError(e)) {
      throw e;
    }
  } finally {
    setCurrentZone(oldZone);
  }
};

Zone.prototype.wrapGuarded = function (f) {
  const thisZone = this;
  return function () {
    return thisZone.runGuarded(() => f.apply(this, arguments));
  };
};

Is enforcing the guard. Does this mean, all async queued tasks would fire CallInZone with a guarded vs unguarded flag like I mocked in https://gist.github.com/bmeck/05957f8721e9f41039fbb0f321fe943a ?

Essentially the question is, does the following code get a guard?

const zone1 = rootZone.fork({
  handleError: handleError1
});
zone1.run(function a() {
  Promise.resolve(0).then(function b() {
    // am I guarded by rootZone?
    throw Error();
  })
});

And does the following continue to get a guard?

const zone1 = rootZone.fork({
  handleError: handleError1
});
zone1.runGuarded(function a() {
  Promise.resolve(0).then(function b() {
    // am I guarded by zone1?
    throw Error();
  });
});

domenic commented 8 years ago

in your example function e is guarded by zone2 even though it is invoked via setTimeout, however I don't see how:

setTimeout uses the equivalent of wrapGuarded (or, stores the current zone at the time it's called, then uses runGuarded in that zone).

Essentially the question is, does the following code get a guard?

setTimeout and Promise.prototype.then are different, so this is a different question than your above one.

Promise.prototype.then knows that when the passed function is called, the promise machinery will handle the errors. So (essentially) it uses wrap, not wrapGuarded.

This the key difference I'm trying to get across in the post. If your library wants to catch errors and process them specially, like a promise implementation does, then use wrap/run. If you don't, then use wrapGuarded/runGuarded.

mhevery commented 8 years ago

The question of run/wrap{Guarded} is simply: Can the code do anything useful with the exception?

In the case of the library, only Promise and Observable have special semantics for exceptions and so they are the only libraries I can think of which should use wrapUnGuarded all other use cases should use wrapGuarded

In the case of application code entering zone, the application code should handle/fail fast on exception, and so there is almost never a reason to use runGuarded. All cases I can think of should be runUnGuarded.

So the rule of thumb is wrapGuarded / runGuarded for libraries, and run for application code entering zone. (Only exception to it is Promise and Observable)

bmeck commented 8 years ago

setTimeout uses the equivalent of wrapGuarded (or, stores the current zone at the time it's called, then uses runGuarded in that zone).

Should be using wrap, it is lib code to my understanding.

This the key difference I'm trying to get across in the post. If your library wants to catch errors and process them specially, like a promise implementation does, then use wrap/run. If you don't, then use wrapGuarded/runGuarded.

This is very confusing, use wrapGuarded if you want errors handled, but not specially?

The fact that setTimeout magically gets guarded can lead to more swallowed errors that are unsafe since the code being swallowed may have turned into corrupted internal state. This corrupted state generally causes a throw but could then be silenced accidentally. I think a flag is needed somehow to delineate what a behavior vs system error is for this not to have the flaws of domains.

I see no reason this cannot be implemented later? TC39 has this marked as ready to move to stage 1, this feature of error handling is not ready to be stage 1 as it basically is domains including the flaws of them (infectious behavior is a big one).

domenic commented 8 years ago

Should be using wrap, it is lib code to my understanding.

No; I'm not sure where you got this "library code should be using wrap" idea from. wrapGuarded is often more appropriate.

This is very confusing, use wrapGuarded if you want errors handled, but not specially?

No. Use wrapGuarded if you don't want to handle errors with try/catch (the guard will handle them).

The fact that setTimeout magically gets guarded can lead to more swallowed errors that are unsafe since the code being swallowed may have turned into corrupted internal state.

This is not true. The behavior of setTimeout is exactly the same here as it is in the existing specs (where setTimeout calls to report the exception). That is all that wrapGuarded does: it says, invoke the thing, but if it throws an error, hand it to the current zone for exception handling.

The root zone's exception handling is just HTML's "report the exception" (or Node's uncaughtException), so unless you introduce intermediate zones with their own error handling, that will always be triggered. There is no swallowing.

I see no reason this cannot be implemented later?

Hmm, I thought I tried to make that extremely clear in the OP. It needs to be implemented now since it impacts how people will use zones:

So, if we later introduced error handling, some fraction of the uses of Zone.current.wrap would be wrong.

TC39 has this marked as ready to move to stage 1, this feature of error handling is not ready to be stage 1 as it basically is domains including the flaws of them (infectious behavior is a big one).

Well, this RFC is trying to get some buy-in on the error handling feature being ready. I think you must still be misunderstanding it if you think it is basically domains, and that it has infectious behavior. Please continue asking questions until I can make it clearer.

trevnorris commented 8 years ago

I think you must still be misunderstanding it if you think it is basically domains, and that it has infectious behavior.

TBH this has worse implications in some cases than domains, if I'm understanding the spec correctly. With domains if I require within a domain it's still possible to escape. e.g.

var my_module;
domain.create().run(() => my_module = require('./my_module'));

// my_module.js
if (process.domain)  // wtf?
  process.domain = null;

Whereas I can't see a way to get around this with Zones. Nor would I be able to see that I'm not using the root Zone (meaning there'd be no .parent). e.g.

// Don't want modules knowing this isn't the root zone.
const z = new Zone({
  name: '(root zone)',
  handleError(e) {
    // Nothing should die
    return true;
  }
});

var my_module;
z.runGuarded(() => my_module = require('./my_module'));

Now, the real problem here is that a module is only ever evaluated once. So if my_module was required somewhere else after this there'd be unexpected side effects.

domenic commented 8 years ago

Hmm, maybe I am not understanding. You can just check if Zone.current.parent === null or not.

It's true that if your caller opts you in to running in a zone, you can't escape. But this is similar to any of the other ways in which your caller can mess up your environment (messing with globals or require.cache; try/catching around the code and doing weird things with the exceptions; etc.). I don't think zones (or domains) were designed to be used in a combative environment.

bmeck commented 8 years ago

In the example given we have:

    rootZone.run(function c() {
      throw new Error("boo");
    });

rootZone has no parent
the rethrow is still caught by zone1

The more I look at this the more I think the problem is the opt-out requirement for unsafe errors. By default errors tend towards unsafe. The example here shows a potentially unsafe error being swallowed. In order to opt-out (saying your code is not 100% safe), you should use zone.run, but here we see an example that your code is treated as safe still.

I personally would think errors default to unsafe to catch, and you would need to be very specific about what errors are able to be handled. Though lets set that aside for a bit.

An error is being caught, implicitly that was not marked as being safe to catch; this is mentioned in the domains post mortem. That is an infection. JS has a problem with try{}catch(e){} being a catch all, where you must opt-in to rethrow an error if it is not handled. This problem was highlighted in an async environment like the ones Zone/Domains aim to work with since shared mutable state is commonplace in JS. Having the implicit catch be a catch all rather than the throw location denoting safety to catch makes this worse.

domenic commented 8 years ago

the rethrow is still caught by zone1

This is just not true...

bmeck commented 8 years ago

Error propagation and handling then occurs like so:

The error is caught by rootZone.run. rootZone has no error handler, so rethrow the error. The error is next caught by this.storedFunction, which is a wrapper around b to run it guarded in zone1. > That sends it to errorHandler1.

domenic commented 8 years ago

Yeah, looks like that is a typo. It should be "The error is not caught by rootZone.run". I will correct the OP.

domenic commented 8 years ago

Looking at the rest of your post, there still seems to be a misconception. You think the try/catch is infectious and catches things unsafely. As I tried to explain, all it does is send things to the zone's error handler---which in the default case is just uncaughtException.

Zones allow overriding that default behavior inside a zone, so that instead of one global mutable state (the uncaughtException handler chain) you can scope it to smaller parts of your program which know how to deal with the errors in a more natural way. Of course, someone could write a pathological handler that ignores all errors---but they could do that for process.on("uncaughtException" too. The difference is with zones the damage they do would be limited to code that they themselves run, instead of global to the whole program.

bmeck commented 8 years ago

I am not so concerned with pathological ignoring of errors, but accidental errors that cause invalid state to be treated as safe.

bmeck commented 8 years ago

This kind of stuff: http://blog.izs.me/post/65712662830/restart-nodejs-servers-on-domain-errors-sensible

domenic commented 8 years ago

Sure. As I said, all zones do is give people a solution that is less footgun prone since it's scoped to their own code, instead of globally shared by the entire program. This seems like a good thing?

bmeck commented 8 years ago

I don't see how it is confined to their own code. Right above we see zone1 start to handle function c, so how is that isolated?

Anything on the stack can catch it still. No concurrency/memory isolation. In your example, programmers need to carefully disseminate in handleError1 and handleError2 if the error is expected; they do not know if function c is in a good state / what it has done. They also do not know if that state is going to affect other concurrent things that are accessing variables that function c may have left in a bad state.

In the example, function c is isolated as best it can be with rootZone.run but zone1 may have no care if it is in a right state. That knowledge is left up to the programmer who may not even own the code for function c knowing that function c is still in a good state, even though it just threw an Error.

In synchronous situations we have the same problem using try{}catch(e){}. Domains and Zones try to introduce an implicit behavior that is not present in those cases by following the async stacks. Once you are in a zone, you are basically there for life. Just like with Promises, safely bailing from unexpected errors means: wrap a try{}catch(e){} around things and abort if the error is a completely unexpected. Rethrow if expected.

@trevnorris :

In order to escape to let unexpected errors escape, you actually need to do:

const z = new Zone();
// this will be like z.runGuarded(c)
z.run(()=>setImmediate(c));

This only works because Zones are not doing stack propagation like domains. I don't really see a way to get to "unguarded" guaranteed except this. If this is done however, you will need to wrap c in a try catch if there are handle-able errors that you wish to not abort.

trevnorris commented 8 years ago

@domenic Could you clarify one thing for me:

const zone1 = Zone.current.fork();
zone1.run(() => net.createServer((c) => {
  const zone2 = Zone.current.fork();
  zone2.run(() => {
    c.on('data', () => {
      // Zone.current == zone1 or zone2?
    });
  });
}).listen(8080));

domenic commented 8 years ago

@bmeck will try to respond to the rest of your comment, but again you seem to misunderstand the proposal.

In order to escape to let unexpected errors escape, you actually need to do:

This is just false. z.run(c) suffices. Errors escape. Look at the implementation of z.run. It does not have any catch blocks.

@trevnorris

Could you clarify one thing for me that's unclear:

In that example Zone.current is zone2. (Unlike domains, there is no special idea of event emitters being bound to a zone. .on stores the passed callback alongside the zone that is current when .on is called, which is zone2.)

bmeck commented 8 years ago

@domenic I can see that rootZone.run(function c() ... does not introduce a try catch. I, however, do see that zone1 catches function c's throw. That is the problem, that is the implicit behavior mentioned in the post mortem. Please reread the post mortem. Focus on Implicit Behavior and Resource Cleanup on Exception.

An alternative is not to re-use the throw statement. Me and @trevnorris were talking about an opt-in "this is safe to handle" mechanism. I am not entirely sure what this would look like, Zone.current.throw however does look sane to avoid the implicit catching of unexpected errors if it were to only propagate through that zone's .parent chain.

bmeck commented 8 years ago

@domenic this may be clearer https://github.com/bmeck/zomains/blob/master/example/accidental-guard.js , backing implementation should match my understanding of Zone's behavior.

mhevery commented 8 years ago

@bmeck let me see if I can shine some light.

'use strict';
require('../');
const UnexpectedError = require('./COMMON').UnexpectedError;
let tocall = 1;
process.on('exit', ()=>{
  if (tocall !== 0) {
    process.reallyExit(1);
  }
});
const child = Zone.current.fork({
  handleError() {
    tocall--;
  }
});
child.run(()=>{
  setTimeout(()=>{
    throw Error('Now handled?!');
  })
});

I think the confusion is about who handles the Error. So if setTimeout throws an error synchronously then it would be not handled by child because child got entered by run. setTimeout could throw synchronously if for example the arguments to it would be incorrect.

(SIDENOTE, There is some other zone above the child. In this example we can assume that it is the root zone, so the synchronous error would propagate to root zone which would forward to uncaughtException. Why? because each VM turn is wrapped in runGuarded)

But in the above example the callback of setTimeout is invoked asynchronously. This means that internally when setTimeout stored the callback it used wrapGuarded, which means that when the callback is invoked it will be invoked as child.runGuarded(() => throw Error(...)). This means that the error will be presented to the child handleError which will tocall-- (and swallow the error. because it did not rethrow)

I believe that is the correct behavior?

bmeck commented 8 years ago

@mhevery it actually does not swallow since it doesn't return true. The code does run in node v4+, and you can see the propagation mechanism in https://github.com/bmeck/zomains/blob/master/example/stop-propagation.js

domenic commented 8 years ago

(and swallow the error. because it did not rethrow)

This is still under debate; the OP's semantics are that you have to return true (or maybe a truthy value?) to mark the error "handled." I had a question under "Issues to discuss" to ask whether we should use true, or rethrow.

Otherwise, @mhevery's explanation is on point. I'm not sure what the linked example was meant to be illustrating, though. I'm also not sure what function c @bmeck's previous post was referring to.

I certainly have read the postmortem, multiple times. I am not sure what your contention is. Is it that neither zones nor domains are perfect for resource cleanup? Yes, this is pretty clear; you can mess up with both (and with promises). A scenario like in the postmortem is complex, and cleaning up after it is complex. The "implicit behavior" section is also fairly inapplicable. There is no global mutable domain.enter() that overrides for all future code; you have to explicitly run code in a zone with zone.run or zone.runGuarded, and as explained upthread, the impact is limited.

bmeck commented 8 years ago

@domenic the fact that Zones are re-enabling guards whenever async thresholds are crossed, even when the original stack was not placing guards is an issue. Another issue is that, when these guards are in place there is not a concept of an unexpected error vs expected error. If the guards could be limited in some way to handle these I would not really have complaints.

I'm going to change from examples to tables perhaps as another means of discussion:

	Zones	Domains
Notification	Sync	Sync
Guarding	Automatic	Automatic
Escape	Partial [1]	Full [2]
Default w/o explicit Guard	Throw	Throw
Default w/ explicit Guard	Throw	Swallow
Propagation	Interruptible	All Handlers

[1] - cannot escape parent Scope's guard synchronously [2] - dispose, manual EE removal, cannot manually remove callbacks

It is the automatic guarding without a full escape mechanism that seriously hoses my concept of this being a good idea. There is no means to skip the zones and propagate directly to the environment synchronously. There is a severe distrust of things that take this approach after the advice and documentation of Domains having little to no effect on people accidentally catching errors. If there was a more robust error handling mechanism / routing I would be less worried. Right now it is better than domains, but still suffers from one of the biggest troubles caused by them.

On a personal note, I think guarding should not be automatic and people should need to guard at task they queue due to concurrency and shared access concerns.

domenic commented 8 years ago

I still don't understand your objection. The way to not use zones to add guards is... Don't use zones to add guards. This feature is opt in. As I said repeatedly, the impact of zones is entirely limited to code run in zones. It's purely additive over today's uncaughtException. There's nothing "automatic" about them.

Guards can of course be limited. Use the JavaScript if statement.

I also disagree strongly with your characterization of promises and would appreciate if you left them out of this because it's clear we're not going to have a productive conversation if we have to retread those discussions.

bmeck commented 8 years ago

@domenic removed, however I don't see how to escape zones if guards are re-applied at every async task w/o some hacky behavior like I showed.

domenic commented 8 years ago

You don't escape zones. Just like anything else a caller can do to control your execution (modify globals, try/catch, etc.), the way to avoid such things if your code doesn't want to run under them is... Don't run your code under them. The control is with the caller, who already controls so many aspects of the environment; the current zone is just one more thing the caller controls.

Stated another way, to escape the current zone you need to escape the current realm. Both are not typically under the control of the caller (although you can always spawn a new one with the vm module or with iframes).

bmeck commented 8 years ago

This implicitly gives guards whenever async tasks are queued. This [and having the caller zone] means escaping zones is less than easy, and cannot be done synchronously. If you assume all errors always are safe to handle and all cleanup and state will still be performed, sure whatever.

const unhandledZone = new Zone({name:"unhandled"});
unhandledZone.run(_ => {
  setTimeout(_ => {
    unhandledZone.run(unsafe_fn);
  });
});

Is the only [sane] way to escape if your function has unsafe to handle behavior that can leave things in a permanent bad state / resource cleanup issues that I see. I just have no assumption that all errors are safe to handle. If our expectations about what is safe to do differs, I doubt this will go anywhere and I will just step out disagreeing.

domenic commented 8 years ago

I am not sure why you keep bringing up resource cleanup or assuming all errors are safe to handle. Zones neither assume that error handler are used to cleanup resources (most people I've talked to want to use them for more granular part-of-app-specific error reporting), nor do they assume that all error handlers will mark all errors as handled (indeed by default per the OP's design they will mark no errors as handled). I still don't understand why this isn't coming across. What can I do to help clarify this further?

Note that there is already an implicit guard whenever async tasks are queued: the guard that sends them to uncaughtException. There will always be an "implicit guard" in any native-code-driven event loop since there's a need to do something in reaction to thrown errors in queued tasks.

All zones do is let you customize this existing implicit guard in a scoped way, instead of using the globally mutable uncaughtException to customize it across your entire program.

bmeck commented 8 years ago

@domenic if this use case was just reporting, adding a guard that did something synchronously w/o ability to swallow, that is a completely different thing in my mind. I keep bringing up resource cleanup and error safety because those are very hard to do when you don't know if your async queued tasks will bubble up far enough to start shutting things down. I still use domains to this day, but every time I turn on error handling I almost immediately turn it off again. Is [the ability to prevent] the error from propagating to the host env a strict requirement here?

domenic commented 8 years ago

The goal of this proposal is not to dictate use cases. If you don't want to do resource cleanup with zones, then I'd suggest not doing resource cleanup with zones?

Is [the ability to prevent] the error from propagating to the host env a strict requirement here?

In general, we want to make zones usable by all programmers for a variety of use cases, and not restrict them based on paternalistic arguments. So yes, we want to allow maximum flexibility. As I said, this is no different from today's uncaughtException; it simply allows tighter scoping so that you don't catch and prevent propagation of all exceptions.

domenic commented 8 years ago

It may help to recall that this is a language feature and needs to be able to serve use cases in many environments---both Node and the browser. Being maximally inclusive is the best path for foundational language features in this way.

bmeck commented 8 years ago

I think error handling needs a closer look than merely enabling all use cases. However, I don't think discussion of domains or other existing attempts and the experience of problems they introduce is going to be relevant here if they are seen as paternalistic arguments.

If the goal is to change how you can contain errors, and encourage behavior similar to uncaughtException in all possible situations, I agree that this would enable such behavior. However, my experiences would lead me to avoid this attempt when defaulting the behavior as described and not having an escape hatch.

If such experience is irrelevant, I don't really have anything to say / am not sure what I was being asked to come and look at?

mhevery commented 8 years ago

@bmeck I think your experience with Domains is valuable, and we do want your input. It feels to me that we don't understand each other proposals and shortcomings. Is there any way we could have a face to face / Video meeting?

trevnorris commented 8 years ago

As I said, this is no different from today's uncaughtException; it simply allows tighter scoping so that you don't catch and prevent propagation of all exceptions.

It should be clarified for posterity that Zones and uncaughtException are different in important ways. First is our ability to be notified when an uncaughtException handler has been set:

process.on('newListener', (e) => print(e, (new Error()).stack.substr(5)));

process.on('uncaughtException', () => {});

While with Zones there is no mechanism that allows observation into Zone creation. The second important distinction is that all uncaughtException handlers can be escaped at any time with a simple

process.removeAllListeners('uncaughtException');

The third difference is that uncaughtException is a global toggle. Unlike Zones which propagate via API calls and asynchronous call stacks. So as far as debugging an application in development, the observability (via public API) of what's happening with uncaughtException make it easier to reason with.

Just like anything else a caller can do to control your execution (modify globals, try/catch, etc.), the way to avoid such things if your code doesn't want to run under them is...

To quickly address the later part of this statement I'd like to point out this demonstrates avoidance of the reality of node's ecosystem. If I don't want to use a Promise-based API, that's easy enough, but Zones propagate and attach themselves to everything automatically. First I'll take the former remark about try/catch as an example:

// mine.js
const Theirs = require('./theirs');
const t = new Theirs();
t.on('complete', () => {});
t.getData();

// theirs.js
const EE = require('events');
const req_caller = require('req_caller');
class Theirs extends EE {
  constructor() { super() }
  getData() {
    try {
      req_caller(info, (data) => this.emit('complete', data));
    } catch (e) {
      // oh no!!! something went wrong in req_caller()
    }
  }
};
module.exports = Theirs;

Here we can see that the try/catch wrapping req_caller() in case a bad argument was passed, or some such. Allowing the execution of the complete callback to happen outside its confines. Unfortunately a Zone error handler would propagate through the req_caller() call and potentially swallow any errors thrown by the application. In the case of domains and uncaughtException it's possible to synchronously escape any error handling. I don't see how that'd be possible with Zones.

I understand what you're saying about each Zone error handler needing to explicitly return true, thus disregarding the potentially problematic behavior of indefinitely propagating Zones through async calls. I'd like to convey that history has shown developers aren't rigorous enough to properly clean up and needing to place a return true; at the bottom of the function isn't enough to make them start.

Combine that with Zone's design to scope functionality (generally a good thing) removes the ability to observe how my own errors are being handled. Going back to the comment about just not using code that uses Zones, I assume you are proposing this in order for as much code as possible to use them, and in the not so uncommon case of having 50+ dependencies/sub-dependencies it will complicate debugging when one of those thinks it's clever to do:

// a.js
const z = new Zone();
let modb;
// First module to require('b'), now a owns it.
z.run(() => modb = require('b'));

When the application uses b it probably expects there's nothing in between it and the module it's using. Unfortunately that may not be the case, leading to confusion:

// mycode.js
const modb = require('b');
const z = Zone.current.fork();
z.run(() => {
  getResources(info, (data) => {
    // I'd expect if modb has an exception that's unhandled by
    // its own handler then it'd immediately propagate to me.
    modb.processData(data, oncomplete);
  });
});

// Here a.js was able to intercept my exception b/c it is in modb's
// Zone propagation chain.
function oncomplete(result) {
  if (result === -1)  // Something bad happened
    throw new Error('ah crap!');
}

The above may only be a hypothetical now, but please believe me that this will happen in the wild, and as module dependencies get deeper and as applications get more complex it may be a while before this would even be noticed, and even harder to find.

An aside, the error handling mechanics proposed, down to returning true to indicate the exception was handled, was implemented in my original AsyncListener API during node v0.11 over two years ago (note: was removed in Dec '14, but added in Sep '13). When I implemented this it seemed like a great idea, but in the end the entire thing was removed because it was found to add too much complexity and made reasoning about error handling difficult (ref: nodejs/node@0d60ab3e). I have been down this road, 3 months of work to get an API very similar to this into node, so believe me when I tell you that a large part of my opposition is from having seen this fail in practice.

mhevery commented 8 years ago

@trevnorris i don't think we are on the same page. I left some comments in the code marked with MISKO

// mycode.js
const modb = require('b');
const z = Zone.current.fork();
z.run(() => {
  getResources(info, (data) => {
    // I'd expect if modb has an exception that's unhandled by
    // its own handler then it'd immediately propagate to me.
    // MISKO: Why would you think that zones would prevent that?
    modb.processData(data, oncomplete);
  });
});

// Here a.js was able to intercept my exception b/c it is in modb's
// Zone propagation chain.
function oncomplete(result) {
  if (result === -1)  // Something bad happened
    throw new Error('ah crap!');
}

Zone's proposal deals with what happens with errors which are unhandled (which get passed to uncaughtException). Zone's do not influence errors which are handled in the library.

This is why I think we are not on the same page and we should step back and better understand what each side is proposing.

Just an FYI, we have been using Zone's in Angular in the browser, and in Dart both in browser and server, where we have had very good experience with them. I understand that dart and browser is not node, but given that we both have had such different experiences, tells me that there is something we are overlooking about the other. I would like to get to that nugget which resulted in such different experiences.

mhevery commented 8 years ago

And as as side note.

in your example:

// a.js
const z = new Zone();
let modb;
// First module to require('b'), now a owns it.
z.run(() => modb = require('b'));

Because your zone does not specify an error handler, then such a zone would be indistinguishable from no zone. My point is that you really have to go out of your way to mess up error handling. It's not something which happens because one forgets a wrong return value.

trevnorris commented 8 years ago

Because your zone does not specify an error handler

Sorry, was being lazy. I'm aware. Will be more explicit in the future.

// MISKO: Why would you think that zones would prevent that?

Brain puttered out near the end of last post. Let me attempt this again. Each step is numbered via (N):

// main.js
// (1) Require 'modb'.
const modb = require('modb');
// (5) Also require sutil for my own needs.
const sutil = require('sutil');
// (6) Fork off a new Zone with error handler.
const z1 = Zone.current.fork({ handleError(e) { } });

// (7) Run command in scope of z2.
z1.run(() => {
  // (8) getResources is defined somewhere in main.js. Is an async
  //     call to retrieve something. If something goes wrong I want it
  //     to be caught by "z1".
  getResources(info, (data) => {
    // (9) Make async call to processData().
    sutil.processData(data, oncomplete);
  });
});

function oncomplete(result) {
  // (11) This callback has been called from "z4" in "sutil.js" below.
  //      Because "z2", created in "modb.js", is in the .parent chain,
  //      the exception will propagate to, and be handled, there before
  //      it hits "z1".
  if (result === -1) throw new Error('ah crap!');
}

// modb.js
// (2) Create, don't fork, a new Zone with error handler.
const z2 = new Zone({ errorHandler() { return true; }, name: '(root zone)' });
let sutil;
// (3) Require sutil in scope of z2, so if sutil uses Zone.current.fork() the
//     .parent chain propagates exceptions to modb.js.
z2.run(() => sutil = require('sutil'));

// sutil.js
// (4) Fork from current zone for error cleanup/logging.
const z3 = Zone.current.fork({ errorHandler(e) { }});

module.exports.processData = function processData(data, cb) {
  // (10) Fork from z3 to funnel error logging, and use this for
  //      data propagation.
  const z4 = z3.fork();
  z4.data = data;
  z4.run(() => yetAnotherAsyncCall((r) => cb(r)));
};

Hopefully this is explicit enough to demonstrate the scenario I described in the previous post. In this case a seemingly unrelated module was able to inject itself info the .parent path of exception bubbling.

My point is that you really have to go out of your way to mess up error handling.

I understand this argument, and to some point it's valid. But in an ecosystem where applications have 50+ dependencies it only takes one to really screw everyone up. Unlike uncaughtException and domains, Zone's scoping will make it difficult to see who's swallowing the error. I can of course inspect Zone.current in all of my callbacks, but I can't force what is Zone.current (can do with domains) nor is there an API call to see where that one is being invoked (can do with uncaughtException).

Also I think you give module authors too much credit in how vigorous they'll be cleaning up resources. If they're using domains today then they'll be using errorHandler() tomorrow, and adding return true at the bottom isn't going to phase them. Hence one reason why we have Warning: Don't Ignore Errors! in big bold letters at the top of the domains doc, and XXX WARNING! BAD IDEA! at the top of the first example.

I would like to get to that nugget which resulted in such different experiences.

Thanks. One failure of this type of error handling is the user's ability to properly clean up after themselves. I don't say this hypothetically. This comes from companies complaining about processes with run-away resource usage because they weren't able to handle error properly.

Here's a simple resource tree scenario for incoming requests:

New http request triggers (1) opening a new file; (2) begin a series of async requests to backend; (3) transform stream to put together the response message.
(2) the async requests are not all done in parallel, since subsequent requests require the response from the previous.
(3) some of the data returned from responses is placed in the transform stream to generate the client response.
(1) as data is written out it's also forked to disk for caching.

Now, to handle this properly any failure must propagate to all other resources to cleanly shutdown. e.g. if writing to the file fails then we need to tell requests to stop, the transform stream needs to be cleaned up and the connection then needs to be sent the error message and closed.

Doing this sort of cleanup reliably, so the application doesn't hold on to gigabytes of Buffers or breaks from EMFILE, etc., is difficult to do. One big difference between node's existing async error handling methods and Zones is that in node every error needs to be explicitly handled. Whereas Zones allow a blanket to cover all of this. Hypothetically this is great, but I've seen too many side effects that leave the application in a bad state. One of the most dramatic I've heard is from @bmeck where the wrong data was sent to the wrong connection. This wouldn't even be that hard to replicate.

Anyway, hope we can talk more about this. Thanks for responding.

mhevery commented 8 years ago

// (1) Require 'modb'.

You bring up an excellent point here. We should make sure that require always executes code in root zone. This way it does not matter who, or which order you got loaded.

// (2) Create, don't fork, a new Zone with error handler.

I see you are trying to make a second root zone. Not sure that this should be allowed since there could only be one root zone. But I don't think being a child of root would change your example.

// (3) Require sutil in scope of z2, so if sutil uses Zone.current.fork() the // .parent chain propagates exceptions to modb.js.

See point 1. require should load in root zone, so the parent zone would be irrelevant.

// (8) getResources is defined somewhere in main.js. Is an async // call to retrieve something. If something goes wrong I want it // to be caught by "z1".

I think this is where we diverge. In order for this to work few things have to be true 1) Zone forking should be done as a local variable in the stack, not as a top level file variable. Only current stack frame has the sufficient contextual information where forking makes sense. Fork should be once per request, not once per file. Yes if you fork once at root, weird things can happen. 2) In my mind setting up a zone with errorHandler which swallows is no different from try{}catch(){} which swallows. One just happens to by async the other sync.

I have a hunch that you are looking at Zones as the thing you set up for a file and all exceptions will be funneled to that Zone, where as I think of it in the context of stack frames and who called who. Where a particular function/method is declared should not have any relevance to what zone it will be in, just like a method can't assume what stack frames will be above it.

Method/function does not live in a zone, it is executed in a zone

Resource release and proper cleanup, can't be done with Zones in the current form. In order to do that, we need to introduce a concept of Tasks, (which we have in Zone.js implementation) but which we don't want to bring to TC39, as it would greatly complicate the discussion.

The way to think about tasks is that each stack frame is associated with a zone. In addition the top most frame in a stack is associated with a Task, which is similar to Zones, but it deals with VM turns rather than Zones. Zones can enter and leave at any point, but Tasks can't. Once you have a concept of a task, then a Zone can know how many outstanding Tasks there are which could have access to a particular resource. When the number of outstanding tasks goes to zero, it should be safe to clean up. The above is not fool proof, but it is better then the ad-hoc solution that we have above. But I digress.

trevnorris commented 8 years ago

We should make sure that require always executes code in root zone. This way it does not matter who, or which order you got loaded.

Cool. So take const global_zone = Zone.current; at bootstrap then do global_zone.run() in require()?

I see you are trying to make a second root zone. Not sure that this should be allowed since there could only be one root zone.

I'm not sure how to determine if a Zone is the root other than by name. Does this mean creating a root with the name '(root zone)' would throw, or there some other mechanism to determine this?

In my mind setting up a zone with errorHandler which swallows is no different from try{}catch(){} which swallows. One just happens to by async the other sync.

Async, and automatically propagates. After N async calls an exception could propagate back through that many stacks, and a number of modules, to be handled. This can cause a debugging nightmare. Here's a simpler example of how a library can intercept an application's exception:

// main.js
const mlib = require('mlib');
const net = require('net');
net.createServer((c) => {
  mlib.doStuff(c, (data) => {
    if (data === null)  // ah crap
      throw new Error('goodbye world');
  });
}).listen(8080);

// mlib.js
module.exports.doStuff = function doStuff(c, cb) {
  const z1 = Zone.current.fork({ errorHandler() { return true } });
  z1.req = { zone: Zone.current, cb };
  z1.runGuarded(() => doSomeAsyncStuff(c, doneWithAsync));
};
function doneWithAsync(data) {
  const req = Zone.current.req;
  req.zone.run(() => req.cb(data));
}

As the application author I'd expect the throw in the doStuff() callback to crash the application. Since I haven't explicitly setup up an exception handler. Instead the module I used will intercept that exception. There's never a good reason for a library to handle an application's exception, where the application hasn't explicitly told the library to do so. The only way I can see to get around this at the moment is:

// mlib.js
module.exports.doStuff = function doStuff(c, cb) {
  const z1 = Zone.current.fork({ errorHandler() { return true } });
  z1.req = { zone: Zone.current, cb };
  z1.runGuarded(() => doSomeAsyncStuff(c, doneWithAsync));
};
function doneWithAsync(data) {
  const req = Zone.current.req;
  (new Zone()).run(() => setImmediate(() => {
    req.zone.run(() => req.cb(data));
  }));
}

Domains had a hard enough time getting developers to do this properly, and all libraries had to do was call .exit() before running the callback.

mhevery commented 8 years ago

Cool. So take const global_zone = Zone.current; at bootstrap then do global_zone.run() in require()?

Zone.current.root gives you root.

I'm not sure how to determine if a Zone is the root other than by name. Does this mean creating a root with the name '(root zone)' would throw, or there some other mechanism to determine this?

Each Zone has a parent. The root zone is the one which has no parent, and behaves indistinguishable from the current platform behavior. Only the platform is allowed to create a root zone, everyone else must fork.

I am a bit confused about your first example. You set up z1 which swallows exceptions, and then you are concerned that exceptions get swallowed? How is this different from setting uncaughtException to ignore errors?

Why do you want mlib.js to create zones in first place? Is your goal to do error logging? (I guess my assumption is that forking zones should be pretty rare occurrence, which I don't think you share, and I am trying to understand this.)

In my mind the only time a library should fork or capture a zone is if it does user queue operations. (Such as implementing work queue)

Could we move this discussion from hypothetical here-is-how-i-can-break-zones to concrete use cases where zones fails to perform as expected. Sorry, it is hard for me to follow, and I think the discussion would be more productive.

bmeck commented 8 years ago

@mhevery Zone.current.root does not currently give you access to the Realm's "root" zone. Spec might need to be amended?

bmeck commented 8 years ago

@mhevery code examples were given that show what were problems for domains; there is a distinct difference in:

behavioral expectations
use case goals
necessary first steps

We should setup a call sometime next week. I setup a doodle to schedule a time:

http://doodle.com/poll/e2m2aezi3egzqvrt#table

domenic commented 8 years ago

You can tell if a zone is the root zone via z.parent === null. You don't need to check the name.

Currently the spec does not prevent creating new root zones via the Zone constructor. Maybe that should be prevented, although in general giving the UA magic powers (i.e. the ability to create a zone without using the constructor) is always a bit weird to me. But we do it all over the web platform so it's probably fine.

You can find the root zone by climbing the Zone.current.parent.parent.parent... chain. A convenience accessor (Zone.root) seems like a reasonable addition, although it's not clear that it's something we should make convenient. (The only use case presented so far is require, which is specialized enough it can do its own tricks.)

Next week is TC39 so I will not be available most days. I'll fill out the doodle but @mhevery can probably substitute.

domenic / zones