Sometimes it is impossible to control async work with zones, because some values are tied to root zone

dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.

https://dart.dev

BSD 3-Clause "New" or "Revised" License

10.08k stars 1.56k forks source link

Sometimes it is impossible to control async work with zones, because some values are tied to root zone #40131

Open dmitryelagin opened 4 years ago

dmitryelagin commented 4 years ago

In some cases, the control over the Dart zone can be limited when we use Future or Stream because some inner constants are tied to the root zone.

For example (inspired by FakeAsync from Quiver):

import 'dart:async';

import 'package:test/test.dart';

void noop([Object _]) {}

void main() {
  group('Counter should', () {
    List<void Function()> actions;

    int testCounter;
    int callbacksCount;
    int periodicCallbacksCount;
    int microtasksCount;

    // Create zone specification that just saves all callbacks
    // for timers and microtasks and then does nothing
    final zoneSpecification = ZoneSpecification(
      createTimer:
        (source, parent, zone, duration, f) {
          callbacksCount += 1;
          actions.add(f);
          return parent.createTimer(zone, duration, noop);
        },

      createPeriodicTimer:
        (source, parent, zone, period, f) {
          periodicCallbacksCount += 1;
          actions.add(() => f(null));
          return parent.createPeriodicTimer(zone, period, noop);
        },

      scheduleMicrotask:
        (source, parent, zone, f) {
          microtasksCount += 1;
          actions.add(f);
          parent.scheduleMicrotask(zone, noop);
        },
    );

    setUp(() {
      actions = [];
      testCounter = 0;
      callbacksCount = 0;
      periodicCallbacksCount = 0;
      microtasksCount = 0;
    });

    test('be incremented two times', () {
      runZoned(
        () {
          // Do some async work to register some callbacks
          Stream<Object>
            .periodic(const Duration(hours: 1))
            .take(1)
            .listen(
              (_) {
                print('onData');
                testCounter += 1;
              },
              onDone: () {
                print('onDone');
                testCounter += 1;
              },
            );

          // Call all registered callbacks syncroniously until there
          // will be no new scheduled callbacks
          while (actions.isNotEmpty) {
            final fns = List.of(actions);
            actions.clear();
            fns.forEach((fn) => fn());
          }

          // Check registrations count
          print('createTimer: $callbacksCount');
          print('createPeriodicTimer: $periodicCallbacksCount');
          print('scheduleMicrotask: $microtasksCount');

          // Test counter
          expect(testCounter, 2);
        },
        zoneSpecification: zoneSpecification,
      );
    });
  });
}

It is expected that the test should pass, but it fails:

onData
createTimer: 0
createPeriodicTimer: 1
scheduleMicrotask: 0
Expected: <2>
  Actual: <1>

package:test_api           expect
test/dart_test.dart 81:11  main.<fn>.<fn>.<fn>
dart:async                 runZoned
test/dart_test.dart 50:7   main.<fn>.<fn>

onDone
✖ Counter should be incremented two times
Exited (1)

We found this on Dart SDK 2.4.0, it also can be reproduced with Dart SDK 2.7.0.

Before it fails, .take(1) closes stream which cancels the subscription. In the current example, cancellation returns Future which is resolved with null in the root zone. It is constant. https://github.com/dart-lang/sdk/blob/master/sdk/lib/async/future.dart#L151 The onDone callback will be called after cancellation Future resolving, but it is already resolved. So, it will be called with scheduleMicrotask from the zone of the resolved Future, which is the root zone. And here we lost control over onDone and why the test failed.

Unfortunately, this is a very popular fail case in tests, where Quiver, RxDart and other Stream-heavy libraries are used and when we try to test something synchronously.

mit-mit commented 4 years ago

cc @lrhn

lrhn commented 4 years ago

This is a thorny area of the async behavior.

Futures completing doesn't necessarily use the microtask at all. One future completing can easily cascade into a number of other futures completing without going back to the microtask queue. As such, intercepting the microtask queue isn't guranteed to prevent futures from completing. If you don't look, the code works as if it was completing the futures in separate microtasks (they are non-interleaved and happening only when another microtask could happen), but if you inspect the microtask queue, then you can see that it's not being used for all of the completion events. There is a "future propagation loop" running, usually inside a single microtask.

It's also a hard to figure out which zone a microtask is scheduled in, and which it should be scheduled in.

If a future completed in zone A, and it has a listener which expects to run in zone B, should the future schedule itself in zone A or zone B (or the root zone, for good measure)?

Picking the listener is tricky, because if the future has two listeners from different zones, B and C, which one should it then schedule itself in? It only needs one schedule because it can, and does, complete both listeners in the same go. SO the choice between listeners is guaranteed to be arbitrary. That actually suggests to use the future's own (A) zone. However, there is no reason for a future to retain its zone after it has completed with a value, and I'd like to avoid that, so maybe using the A zone isn't the best choice anyway. I'm willing to use the root zone here, which definitely means that you won't be able to prevent the microtask.

Currently I believe we always use the A zone.

I am willing to look at this (read: I have been looking at it), but I haven't found a clearly superior approach yet.

tl;dr: Don't expect intercepting the microtask queue to allow you to control futures. It may or it might, and which one it is will depend on the accidental ordering of events.

dmitryelagin commented 4 years ago

@lrhn, thank you for your response! Still, I have some additional thoughts on this topic.

Although I understood the reasons for scheduling once and in the Future's own zone, it is still a very confusing behavior for users. The simplest found way to fix an issue is to wrap resolved Future with Future.value(resolvedFuture);:

import 'dart:async';

import 'package:test/test.dart';

void noop([Object _]) {}

void main() {
  group('Future should', () {
    int schedulesCount;

    Future<int> resolvedExternalFuture;

    final externalZone = Zone.current;

    final handledZone = Zone.current.fork(
      specification: ZoneSpecification(
        scheduleMicrotask: (source, parent, zone, f) {
          schedulesCount += 1;
          parent.scheduleMicrotask(zone, noop);
          f();
        },
      ),
    );

    setUp(() {
      schedulesCount = 0;

      externalZone.run(() {
        resolvedExternalFuture = Future.sync(() => null);
      });
    });

    test('schedule microtasks in handled zone', () {
      handledZone.run(() {
        // This will be registered in handled zone but will be scheduled
        // in external zone which is confusing and will not increment counter
        resolvedExternalFuture.whenComplete(() {
          print('external: scheduled');
        });

        // This is a workaround to force the callback to be scheduled
        // within handled zone
        Future.value(resolvedExternalFuture).whenComplete(() {
          print('workaround: scheduled');
        });
      });

      expect(schedulesCount, 1);
    });
  });
}

The response will be:

workaround: scheduled
external: scheduled
✓ Future should schedule microtasks in handled zone
Exited

However, this can't be used, when the problem begins and happens in external code and before I can somehow influence it. This leads me to think that the root of the problem is the Future static constants. In the case of the first test of thread - Future._nullFuture.

The nullFuture is made before any code has a chance to make Future or Stream, and that's why it is bound to root zone out of the box. But it means that I can get Future with root zone even when all of my Futures or Streams were made and processed in my single custom zone. This fact is pretty confusing and leads to behavior from the first test.

IMO, something like _Future<Null>.value(null); instead of nullFuture (in places like this) can completely resolve the whole special case of this thread.

lrhn commented 4 years ago

Not using a reusable nullFuture will definitely change some things. Maybe not to what you expect (you never know what zone the plumbing code that propagates future results run in). It would make optimizations like https://github.com/dart-lang/sdk/blob/master/sdk/lib/async/stream_impl.dart#L211 impossible (or harder), though.

yjbanov commented 4 years ago

@lrhn

It's also a hard to figure out which zone a microtask is scheduled in, and which it should be scheduled in.

It is true that the Zone API is very flexible and you can write code where it is indeed hard to figure out what runs in which zone. However, without any discipline zones become quite useless. Currently, most major frameworks - Flutter, AngularDart, Rx, Quiver - rely on the following two properties:

A callback executed asynchronously must execute in the zone it was registered in.
The receiver of an async callback schedules microtasks and timers in the current zone.

The first property seems to hold in Stream.listen(onDone). However, it violates the second property.

This is very surprising given that Future.then, Future.delayed, Time.run, scheduleMicrotask, Stream.listen, Window.onDrawFrame (or any Window.on* callback in Flutter), and many more all uphold these properties. This allows them to be combined and chained in arbitrary ways without leaving the zone.

We use this to implement a number of useful features:

WidgetTester creates a fake timeline (using Quiver's FakeAsync) allowing you to run tests that span many seconds in logical time in milliseconds of real time. You can even write a widget test that animates for 10 years and run that test in less than a second.
Developers write "activity tracking frameworks" that allow them to collect analytics and performance metrics from their apps by creating zones for user actions and measuring their performance.
Developers collect crashes due to uncaught errors by running apps inside a zone with an uncaught error handler.
AngularDart uses zones to detect user actions in order to kick a dirty check cycle to update the UI. Kicking code out of the NgZone will result in UI to stop updating.

It is therefore very surprising to see Stream.listen(onDone) to kick you out into the root zone. I think this is a serious bug.

yjbanov commented 4 years ago

Here are some bug reports associated with this issue:

https://github.com/google/quiver-dart/issues/583 https://github.com/flutter/flutter/issues/40074 https://github.com/flutter/flutter/issues/17738 https://github.com/ReactiveX/rxdart/issues/365

lrhn commented 4 years ago

I don't think we've ever promised that "The receiver of an async callback schedules microtasks and timers in the current zone." In fact, I don't even know what it means. The zone that is current when? I would read it as saying that Future.then(something) should schedule microtasks related to something in the zone that is current when then(something) is called. It currently doesn't, and again, if there is more than one listener, then it won't use both. And if there is no listener, then the option does not exist, but we'll still schedule something to complete the future asynchronously (at least if it is an error).

For callbacks of Future and StreamSubscription we ensure that the callbacks remember the zone they were added in, and run in that zone. It's not given which zone the underlying controller/future runs scheduleMicrotask in.

For Future it will likely be the zone where the future was "created" (which is not a specified concept, but when you call Future.then, you create the result future in the current zone of that call). The future actually only holds on to that zone in order to use its scheduleMicrotask, and I'd like to avoid that memory tax. And again, not all future completions use scheduleMicrotask, some just piggyback on other futures being completed, so we can complete multiple futures in a single microtask.

A stream will "run" in the zone where the listen was called. We just call the onListen in the zone where listen was called, and let the controller take it from there. That usually ensures that the events will come from that zone, but the stream controller actually mostly ignores zones, it's the stream subscription which switches zones when it calls its callbacks. So, schedule microtask will happen in the zone that the controller is invoked in. (For pauses and resumes, it's probably the zone that the calls to the stream subscription happens in).

The issue here seems to be that the onCancel call returns a future, and somehow that future's zone is used to schedule a microtask (because it's a completed future). There is nothing we can really do to prevent this without changing the behavior of completed futures (to, say, use the listener's zone).

We can try that, but it is very likely that a lot of other unstable code will break then.

yjbanov commented 4 years ago

@lrhn

I don't think we've ever promised that "The receiver of an async callback schedules microtasks and timers in the current zone."

It doesn't have to be a promise, but the default. If nothing in the documentation explicitly states that an API is shifting zones then the expectation is that it won't move my code to another zone. Any other behavior leads to surprises. After all, the Zone API is all about being able to intercept timers, microtasks, and other operations. In order to succeed zone behavior must be predictable. If some API doesn't openly scream "I'm messing with zones" (such as FakeAsync does in both the class name and in the docs) but then goes ahead and messes with zones, it makes the zone API too unpredictable and confusing to be useful in practice.

So far the vast majority of core libraries has been well-behaved. This instance is one of the rare exceptions, which makes it so surprising.

In fact, I don't even know what it means. The zone that is current when?

All code runs in some zone, if anything, the root zone. Consider a block of Dart code (a series of statements). Whatever zone the first statement ends up running in (the "current zone"), then, by default, all other statements should run in the same zone, unless there's something that very clearly shifts the zone. In the example above the statements inside onDone unexpectedly end up in a different zone:

          print(Zone.current.hashCode);
          Stream<Object>  // zone 1
            .periodic(const Duration(hours: 1))  // zone 1
            .take(1)  // zone 1
            .listen(  // zone 1
              (_) {
                print(Zone.current.hashCode);
                print('onData');  // zone 1
                testCounter += 1;  // zone 1
              },
              onDone: () {
                print(Zone.current.hashCode);
                print('onDone');  // zone 2!!!
                testCounter += 1;  // zone 2!!!
              },
            );
          // zone 1

This is unexpected because there's nothing about producing a periodic stream of events that implies zone changes. It's a pure in-memory computation that relies on microtasks and timers, all of which are interceptable by zones.

Edit: added Zone.current printouts that shows zone changes.

lrhn commented 4 years ago

We definitely do promise that the callbacks to Future's then and catchError methods, and to Stream's listen method will run in the zone where that call was made. If they don't, that's a bug.

The onListen etc. callbacks on a stream controller happens in the zone where the user called listen, pause, resume or cancel, not where the callbacks were set. (That too is a promise, and it's what guarantees that a stream listen operation actually happens in the zone where the listen method was called).

I honestly don't understand the connection between a Future and a Completer and whatever zones are involved. The Future has a zone before it's completed. That never made sense to me.

Everything else is less specified. We do not guarantee where the implementation runs, which does affect any microtasks it might schedule.

yjbanov commented 4 years ago

We definitely do promise that the callbacks to Future's then and catchError methods, and to Stream's listen method will run in the zone where that call was made. If they don't, that's a bug.

The onListen etc. callbacks on a stream controller happens in the zone where the user called listen, pause, resume or cancel, not where the callbacks were set. (That too is a promise, and it's what guarantees that a stream listen operation actually happens in the zone where the listen method was called).

This is both surprising and undocumented. Surprising, because what would the author of the onListen callback, which is conceptually "on the other end of listen", know about the behavior of the zone where listen runs. Imagine two libraries each running in its own zone, ending up at two ends of a stream, one at the listener side, and another at the stream controller side. These two libraries should be able to have control over zones their code runs in. The stream should establish a safe connection between the two zones without entangling them.

I honestly don't understand the connection between a Future and a Completer and whatever zones are involved. The Future has a zone before it's completed. That never made sense to me.

I don't think Future needs to have a zone in a way that leaks to the call site (i.e. code that creates and waits for the future).

Everything else is less specified. We do not guarantee where the implementation runs, which does affect any microtasks it might schedule.

It is actually fine for the implementation to run in a different zone, as long as it has a good reason to do so and has good docs. Stream's onDone doesn't have either.

Microtasks and timers are special, because ZoneSpecification has explicit handling for them. Hence the expectation that if an implementation schedules microtasks/timers, the zone should be able to intercept them.

lrhn commented 4 years ago

I admit that the zone interaction with streams and futures is almost entirely undocumented.

The one thing that you can depend on is that only callbacks which are registered in a zone will also be called in a specific zone. That's the then/catchError/whenComplete callbacks of Future and the onData/onError/onDone callbacks of StreamSubscription. These are the callbacks which correspond to the asynchronous computation.

All other callbacks are not zone aware, and will be called in whatever zone is current at the time they are triggered. These are control callbacks for streams (onListen / onPause / onResume / onCancel) and all callbacks called from stream builder functions like Stream.where and Stream.map. The former is called in the zone where the user called stream.listen/subscription.pause, the latter only gets attached when the modified stream is listened to, so it's in the same zone as the listen call.

Making it possible to intercept microtasks/timers/etc. in a somewhat predictable way is actually why the stream controller's onListen runs in the zone of the stream's listen call. Everything below the listen call is considered implementation, and this way the implementation that creates stream events on a stream subscription, and schedules them, will run in the zone where the stream was listened to. Stream objects themselves are passive, they have no computation behind them, and therefore they have no zone. The computation belongs to the stream subscription which is created when you listen to the stream. That's why the even creation code runs in that zone, so you do get to control how microtasks are used.

Zones are still a hot mess in many other ways, but I think that particular choice is reasonable.

It does mean that if you move a stream subscription into a different zone and add listeners there, the code providing events and the code consuming events will no longer be in the same zone, because setting a new listener with subscription.onData(handler) will remember the current zone and run the handler callback in that zone.

One big confuser here is that a Future contains the zone of its "computation" even when all you did was create a completer. The future belongs to that zone after the completer was created, and completing it from a different zone can get weird (well, maybe, I can't say what happens without checking the code).

The other issue is that sometimes the future implementation needs to schedule a microtask. It has so far consistently used the zone that the future was created in. I'd be fine changing that to the zone of the (well, a) listener when there is a listener. If there isn't, it's an error future with no current listener, and we don't generally share those, so the current zone is likely fine.

It is a change, and since people keep depending on the behavior we have, whether promised or not, it could be a breaking change.

(I've wanted to change Future for a while, so that the zone of the future depends on the place the completer is completed, not where it's created, which would also avoid any need to store a zone in a non-error future. I also want to remove the concept of error zones where errors cannot propagate across boundaries, because the concept is so ill-defined that I can't tell you what it means, and all it ends up doing is having some futures surprisingly never complete.)

fzyzcjy commented 2 years ago

Hi is there any updates? This indeed makes fake_async quite useless...

lrhn commented 5 months ago

No new plans. The fake_async package has some usability issues that make it very easy to have deadlocks or just no progress, but that's not entirely because of this issue. It's inherently tricky to use correctly.

The constant null future is a problem, and there is a fix for it, but that fix actually breaks some tests that rely on the current behavior of running some code in the root zone. I gave up on landing that change once. I'm willing to look at it again, but don't expect a different outcome.

rbellens commented 1 week ago

StreamSubscription.cancel currently returns the Future._nullFuture. I would expect that when calling cancel in a particular zone, this returns a future that is created in that zone and therefore uses the zone's scheduleMicrotask. Because this is not the case, the fake async zone cannot control this future and calling flushMicrotasks will not execute any registered callbacks. This behavior often causes tests with fake async not to complete, which is very difficult to debug.

I don't know how important the use of the Future._nullFuture is for the overall performance, but I would expect that public api's don't return this unless called from the root zone.

lrhn commented 6 days ago

The _nullFuture doesn't exist for performance reasons, but for timing. It exists because the cancel method could return null prior to null safety, which was interpreted to not introduce an async delay. The recognizable _nullFuture is recognized in the same places to not introduce a delay, keeping the same timing as before.

It does that in the root zone, which was actually not always true. It used to be created lazily in whichever zone first accessed it. Using any one zone is a problem, because it is returned from code running on other zones than the root zone.

Changing that to, fx, a null future per zone now breaks bad code that relies on that future being in the root zone. If we can find and fix that bad code, somehow, maybe we can fix it.

rbellens commented 6 days ago

Can you show me an example of such bad code? I'm interested to learn more.

natebosch commented 5 days ago

Can you show me an example of such bad code?

We have not tracked it down to a minimal reproduction case.

lrhn commented 5 days ago

I think I remember one example using fake_async which depended on one particular future being in the root zone, and therefore running even when the fake-async zone was not being driven forward. Code triggered by that future would then call elapse on the fake-async context. If the future changed to be inside the fake-async zone, then nothing happened.