Excessive inlining may optimize away the function to benchmark

wismill commented 1 year ago

I realized that it is really easy to be fooled by excessive inlining.

The following file illustrates this using GHC 9.4.5:

If run with -O1, then every function is benched normally.
If run with -O2, then some functions are optimized away (see comments). This is visible in Core output.

The functions are trivial, but I got this issue with less trivial functions in unicode-data.

Bench.hs

```haskell #!/usr/bin/env cabal -- Make this executable and run it with ./Bench.hs {- cabal: build-depends: base >= 4.16 && < 4.19, deepseq >= 1.4 && < 1.5, tasty-bench >= 0.3.4 && < 0.4, ghc-options: -O2 -Wall -fdicts-strict -rtsopts -with-rtsopts=-A32m -fproc-alignment=64 -ddump-simpl -ddump-stg-final -ddump-cmm -ddump-to-file -dsuppress-idinfo -dsuppress-coercions -dsuppress-type-applications -} module Main where import Test.Tasty.Bench (defaultMain, bench, nf, bgroup) import Control.DeepSeq (NFData (..), deepseq) import Data.Ix (Ix(..)) newtype MyInt = MyInt Int instance NFData MyInt where {-# NOINLINE rnf #-} rnf (MyInt a) = rnf a {-# INLINE f #-} f :: Int -> Int f cp = if cp < 790 then negate cp else cp main :: IO () main = defaultMain [ bgroup "single value" [ bench "negate Int" (nf negate (789 :: Int)) -- not ok: rnf & negate too small , bench "negate MyInt" (nf (MyInt . negate) (789 :: Int)) -- ok: rnf not inlined , bench "negate maybe" (nf (Just . negate) (789 :: Int)) -- not ok: rnf & negate too small , bench "negate list" (nf ((:[]) . negate) (789 :: Int)) -- ok: rnf big enough , bench "gcd" (nf (gcd 123) (789 :: Int)) -- ok: gcd big enough , bench "f" (nf f (789 :: Int)) -- ok: f big enough ] , bgroup "range (rnf on function)" [ bench "negate MyInt" (nf (foldr (\n -> deepseq (MyInt (negate n))) () . range) (789 :: Int, 799)) -- not ok , bench "gcd" (nf (foldr (\n -> deepseq (gcd n 123)) () . range) (789 :: Int, 799)) -- ok , bench "f" (nf (foldr (\c -> deepseq (f c)) () . range) (789 :: Int, 799)) -- not ok ] , bgroup "range (rnf on accumulator)" [ bench "negate MyInt" (nf (foldr (\n -> (`deepseq` (MyInt (negate n)))) (MyInt minBound) . range) (789 :: Int, 799)) -- ok , bench "gcd" (nf (foldr (\n -> (`deepseq` (gcd n 123))) minBound . range) (789 :: Int, 799)) -- ok , bench "f" (nf (foldr (\c -> (`deepseq` (f c))) minBound . range) (789 :: Int, 799)) -- ok ] ] ```

Conclusion:

This is not a bug per se, but I think it is worth to add a warning about excessive inlining in the documentation.
Is there a systematic way to avoid this excessive inlining? Or do we have to check every time the Core output?

@harendra-kumar

Bodigrim commented 1 year ago

Interesting, thanks. Did you try to find out which part of -O2 triggers this behavior? Does it have something to do with -fdicts-strict?

A documentation improvent would be welcome.

I imagine noinline is a systematic way to fix, depending on what exactly you want to benchmark. Often you do want to inline f itself, you just do not want it to be pre-evaluated.

wismill commented 1 year ago

So, I removed the default -fdicts-strict (implied by -O2 anyway) and applied separately the following flags to cancel some optimisations from -O2:

-fno-dicts-strict: no significant change
-fno-asm-shortcutting: no significant change
-fno-stg-lift-lams: no significant change
-fno-liberate-case: no significant change
-fno-fast-pap-calls: no significant change
-fno-spec-constr : significant change: act as if -O1 was used.

So it is -fspec-constr that trigger the issue. But using -fno-spec-constr is a hammer, as some optimisations do work with respect to benchmark.

wismill commented 1 year ago

@Bodigrim I am stuck on this one. Any idea how to investigate further?

Bodigrim commented 1 year ago

Tactically you can try tasty-bench < 0.3.4.

Could you possibly share your unminimized motivational example? I'd like to understand how small a function should be to get completely eliminated.

wismill commented 1 year ago

@Bodigrim The code is located in unicode-data bench directory. It was first added in this PR. Since it has been reviewed by experienced Haskellers (including you 😉), I think the issue is not trivial.

I think the issue comes down to the following function:

{-# INLINE foldString #-}
foldString :: forall a. (NFData a) => (Char -> a) -> String -> ()
foldString f = foldr (deepseq . f) ()

-- foldString is used to benchmark a function over a list.
-- The (simplified) benchmark code is:
bench "xxx" (nf (foldString f cs))
-- where cs is a precomputated String

Instead it should probably be more like the following, in order to ensure f is always evaluated on each list item:

{-# INLINE foldString #-}
foldString :: forall a. (NFData a) => (Char -> a) -> a -> String -> a
foldString f = foldr (\c -> (`deepseq` f c))

The drawback is that one must provide a default value.

I think we should document tasty-bench on how to bench a function over a Foldable, avoiding pitfall such as depending on function size.

Bodigrim commented 1 year ago

That's all pretty annoying and AFAICT caused by my fix of https://github.com/Bodigrim/tasty-bench/issues/44.

I am open to improving documentation, but I'm not sure what to say beyond plain nf (map f . range) (lo, hi). What's the reason you invoke deepseq inside of nf?

A more fundamental fix is that tasty-bench can measure how long an empty loop takes. If a benchmark measurement is close to an empty loop, we can throw a warning that something's wrong.

wismill commented 1 year ago

I'm not sure what to say beyond plain nf (map f . range) (lo, hi). What's the reason you invoke deepseq inside of nf?

The issue with nf (map f . range) (lo, hi) in general is that the list is built.

In the original bench code, using foldr with range allowed to avoid this issue. I used deepseq to force each range item. This item may be a list or an enum member. But in our current code a list is actually built using env, because we filter out some Unicode code points and this should not be part of the benchmark. Still, nf (map f) cs allocates a new list. I think using foldr in some way should allow us to focus on the function being benchmarked.

A more fundamental fix is that tasty-bench can measure how long an empty loop takes. If a benchmark measurement is close to an empty loop, we can throw a warning that something's wrong.

You mean to check this systematically? That would be excellent to avoid this trap! Would it be complex to implement?

Bodigrim commented 1 year ago

Still, nf (map f) cs allocates a new list.

This sounds like a manifestation of https://github.com/Bodigrim/tasty-bench/issues/39. I wonder whether we should change the definition of nf as suggested in there.

You mean to check this systematically? That would be excellent to avoid this trap! Would it be complex to implement?

It should be doable. Each time when we run measure on an actual Benchmarkable, measure the same number of iterations for a dummy Benchmarkable which does nothing. If results are close (e. g., actual measurement is less than 3x of a dummy one), it's wonky and unreliable. This heuristic however might need some tuning, one has to experiment a bit.

Bodigrim commented 1 year ago

It should be doable.

Not really unfortunately. Loops in bgroup "range (rnf on function)" are not completely empty: they do iterate over range, just do nothing else.

Bodigrim commented 1 year ago

https://github.com/Bodigrim/tasty-bench/blob/8b9f5850d1396d4be3eff0dd882661ec6b2f85ca/src/Test/Tasty/Bench.hs#L1313-L1318

AFAIU what happens is that under certain aggressive optimization flags GHC is able to see that

frc (f x) cannot throw an exception,
neither frc (f x) nor f x are referenced elsewhere, so no point in forcing the thunk.

If both conditions hold GHC just throws away _ <- evaluate (frc (f x)) and spins an empty loop.

Bodigrim commented 1 year ago

The issue with nf (map f . range) (lo, hi) in general is that the list is built.

Ideally this should just work. One problem is that unsaturated application of range will not inline. It's better to write nf (\x -> map f [lo..x]) hi. Another problem is that deepseq for lists does not fuse, this is tracked in https://github.com/haskell/deepseq/pull/99 now.

Bodigrim commented 6 months ago

I extended the documentation a bit in eb9aa0e, but struggle to give a better advice. Feel free to reopen if there are more suggestions / ideas.

wismill commented 3 months ago

@Bodigrim FYI: based on your insights here and in #56, I chose to use pinned ByteArray# and noinline:

Share ByteArray# in benchmarks groups instead of creating lists, to reduce pressure on GC.
Use noinline to avoid excessive inlining. I went for using it as a big Hammer for all functions, instead of checking which function requires it. So it disadvantage some functions, but all in all this simplify maintenance, else we would have to check the requirement every time.

Ideally we would have a detector for excessive inlining. AFAIK there are 3 directions:

From your comment above, could we simulate that the function can raise an exception to avoid the empty loop?
Provide to tasty-bench a reference function and a factor k in addition to the real function to bench. That real function should run at least k times slower than the ref function, similar to --fail-if-faster. Wait, I just realize we have bcompareWithin! But this seems like a hack that would require to maintain both loops in sync and add dummy bench to the results.
Do it the hard way: check the generated Core automatically. I have not investigated this yet; I know some projects check rules firing (text to check fusion), so there may be a compiler plugin or something that may help here with inlining.

Well, there is another possibility: control inlining with pragmas and phases control.

That’s a lot to investigate for me, but it seems worth in order to understand and control GHC optimizations framework. Any pointer would be welcome!

Bodigrim commented 3 months ago

Core plugins are able to inspect Core, but it's not clear to me what pattern to look for. Maybe just print Core for manual inspection?.. In such case inspection-testing would do.

If bcompareWithin can do the job, I'd rather not introduce additional mechanisms.

From your comment above, could we simulate that the function can raise an exception to avoid the empty loop?

It would be better to talk to someone more knowledgeable than me, I'm far from expert in GHC optimizations. But even if my analysis there was correct, introducing an inaccessible exception could affect measurements for benchmarks, unaffected by the problem.

Bodigrim / tasty-bench

Excessive inlining may optimize away the function to benchmark #48