flutter / flutter

Flutter makes it easy and fast to build beautiful apps for mobile and beyond
https://flutter.dev
BSD 3-Clause "New" or "Revised" License
166.35k stars 27.54k forks source link

[Impeller] Optimize surfaces / offscreen buffers to be 40 bits/pixel. #127223

Open gaaclarke opened 1 year ago

gaaclarke commented 1 year ago

When investigating the regression in performance caused by wide gamut support we found two things: 1) The regression didn't show up on A15, but did on A13 1) The majority of additional time was in the blur fragment shader

We knew 64 bits/pixel is documented to be slower. Apple's recommendation is to use 40bits/pixel instead (a BGR10_XR color buffer with an u8 alpha buffer). That should make those operations faster.

The difficulty with that change is that using that scheme would require: 1) a second set of shaders since it would be reading from 2 samplers to get full color 1) all the logic that uploads and downloads textures would have to be updated to manage two textures 1) the blit operations to the surface would need to be updated to do 2 blits 1) the surface would have to be changed to match the 40bits/pixel scheme

Maybe in a future where we don't have to support non-wide gamut devices it will be more palatable to throw away the old code and just embrace the 40bits/pixel path. However, it seems that in that same future the cost of 64bits/pixel vs 40bits/pixel is negligible. 64 bits/pixel may even be faster on newer hardware since it's just using one sampler.

cc @jonahwilliams

knopp commented 1 year ago

FWIW, there is a pretty significant regression reproducible on A15 (iPhone 13 Pro) as well

import 'dart:ui';
import 'package:flutter/material.dart';

void main() {
  runApp(const MainApp());
}

const kSigma = 1.0;
const kNumberOfBlurs = 6;

class _Blur extends StatelessWidget {
  const _Blur();

  @override
  Widget build(BuildContext context) {
    return ClipRRect(
      child: BackdropFilter(
        blendMode: BlendMode.srcIn,
        filter: ImageFilter.blur(
          sigmaX: kSigma,
          sigmaY: kSigma,
        ),
        child: Container(
          color: Colors.red.withAlpha(30),
          child: const Text('Blur'),
        ),
      ),
    );
  }
}

class MainApp extends StatelessWidget {
  const MainApp({super.key});

  @override
  Widget build(BuildContext context) {
    return MaterialApp(
      home: Scaffold(
        body: Stack(
          children: [
            ListView.builder(
                itemBuilder: (context, index) {
                  return Container(
                    padding: const EdgeInsets.all(10),
                    child: Text('Item $index'),
                  );
                },
                itemCount: 1000),
            for (var i = 0; i < kNumberOfBlurs; ++i) ...[
              Positioned(
                left: 0,
                right: 0,
                top: i * 150,
                height: 60,
                child: const _Blur(),
              ),
            ]
          ],
        ),
      ),
    );
  }
}

Wide gamut disabled:

Screenshot 2023-07-29 at 16 40 56

Wide gamut enabled:

Screenshot 2023-07-29 at 16 42 13

This is intentionally with sigma=1 to measure render pass overhead. Though things do get worse worse with increased sigma (i.e. sigma 30 - 8ms vs 22ms).

In our production app we have two blurs (toolbar + tab bar) and are unable to hit 60fps (let alone 120fps) with wide gamut enabled.

knopp commented 1 year ago

Also, I tried a quick and dirty hack to reuse offscreen textures. It seem to save about 1 - 1.5ms per frame in the wide gamut case. Doesn't solve the issue but it's certainly something to consider.

knopp commented 1 year ago

So for each of these 6 relatively small backdrops, there is MSAA backdrop texture fill that blits the content of entire frame. These blits in total take about 80% of rendering time. @bdero, and ideas here?

knopp commented 1 year ago

Here's the performance on iPhone 13 Pro, wide gamut, no MSAA backdrop (LoadAction::kLoad + StoreAction::kStoreAndMultisampleResolve + just resolve on last pass):

Screenshot 2023-07-29 at 22 42 22

MSAA backdrop from above for comparison:

Screenshot 2023-07-29 at 16 42 13

In this particular case (wide gamut) MSAA load/store seems to perform better than bliting from previous resolve. But it's still pretty slow (~ 1ms per render pass). I'm wondering if we could render backdrops that don't sample from each other in one pass. This would help with common case of blurred header + tabbar. @jonahwilliams

I think something like "if this is not the first pass and nothing in this pass has rendered where the backdrop is sampling from don't end the pass" would already be a significant improvement.

knopp commented 1 year ago

For reference, with a quick hack to render all backdrops in single pass (same visual result because they don't sample from each other):

Screenshot 2023-07-29 at 23 55 20

(note that this is with sigma=1; The blur performance is still an issue, but unrelated to render pass overhead).

knopp commented 1 year ago

Didn't mean to hijack this issue - moved to https://github.com/flutter/flutter/issues/131567 and https://github.com/flutter/flutter/issues/131568.

jonahwilliams commented 7 months ago

Shower thought @gaaclarke , doesn't switching to BGR10_XR solve the plus clamping problem?

gaaclarke commented 7 months ago

Shower thought @gaaclarke , doesn't switching to BGR10_XR solve the plus clamping problem?

Yep, there are a lot of other benefits to using BGR10_XR. I'd like it if we were using that. Originally when the feature shipped it was using it, but we had to drop back to f16 when we ran into a bug. I can't remember what it was off the top of my head now though.

edit: we had it for opaque flutter views, for transparent ones we always used f16

jonahwilliams commented 7 months ago

I see the following comment:

    // MTLPixelFormatRGBA16Float was chosen since it is compatible with
    // impeller's offscreen buffers which need to have transparency.  Also,
    // F16 was chosen over BGRA10_XR since Skia does not support decoding
    // BGRA10_XR.

The documentation for https://developer.apple.com/documentation/metal/mtlpixelformat/mtlpixelformatbgra10_xr?language=objc says:

The alpha component is always clamped to a [0.0, 1.0] range in sampling, rendering, and writing operations, despite supporting values outside this range.