Bodigrim / tasty-bench

Featherlight benchmark framework, drop-in replacement for criterion and gauge.
https://hackage.haskell.org/package/tasty-bench
MIT License
80 stars 11 forks source link

Excessive inlining may optimize away the function to benchmark #48

Closed wismill closed 6 months ago

wismill commented 1 year ago

I realized that it is really easy to be fooled by excessive inlining.

The following file illustrates this using GHC 9.4.5:

The functions are trivial, but I got this issue with less trivial functions in unicode-data.

Bench.hs ```haskell #!/usr/bin/env cabal -- Make this executable and run it with ./Bench.hs {- cabal: build-depends: base >= 4.16 && < 4.19, deepseq >= 1.4 && < 1.5, tasty-bench >= 0.3.4 && < 0.4, ghc-options: -O2 -Wall -fdicts-strict -rtsopts -with-rtsopts=-A32m -fproc-alignment=64 -ddump-simpl -ddump-stg-final -ddump-cmm -ddump-to-file -dsuppress-idinfo -dsuppress-coercions -dsuppress-type-applications -} module Main where import Test.Tasty.Bench (defaultMain, bench, nf, bgroup) import Control.DeepSeq (NFData (..), deepseq) import Data.Ix (Ix(..)) newtype MyInt = MyInt Int instance NFData MyInt where {-# NOINLINE rnf #-} rnf (MyInt a) = rnf a {-# INLINE f #-} f :: Int -> Int f cp = if cp < 790 then negate cp else cp main :: IO () main = defaultMain [ bgroup "single value" [ bench "negate Int" (nf negate (789 :: Int)) -- not ok: rnf & negate too small , bench "negate MyInt" (nf (MyInt . negate) (789 :: Int)) -- ok: rnf not inlined , bench "negate maybe" (nf (Just . negate) (789 :: Int)) -- not ok: rnf & negate too small , bench "negate list" (nf ((:[]) . negate) (789 :: Int)) -- ok: rnf big enough , bench "gcd" (nf (gcd 123) (789 :: Int)) -- ok: gcd big enough , bench "f" (nf f (789 :: Int)) -- ok: f big enough ] , bgroup "range (rnf on function)" [ bench "negate MyInt" (nf (foldr (\n -> deepseq (MyInt (negate n))) () . range) (789 :: Int, 799)) -- not ok , bench "gcd" (nf (foldr (\n -> deepseq (gcd n 123)) () . range) (789 :: Int, 799)) -- ok , bench "f" (nf (foldr (\c -> deepseq (f c)) () . range) (789 :: Int, 799)) -- not ok ] , bgroup "range (rnf on accumulator)" [ bench "negate MyInt" (nf (foldr (\n -> (`deepseq` (MyInt (negate n)))) (MyInt minBound) . range) (789 :: Int, 799)) -- ok , bench "gcd" (nf (foldr (\n -> (`deepseq` (gcd n 123))) minBound . range) (789 :: Int, 799)) -- ok , bench "f" (nf (foldr (\c -> (`deepseq` (f c))) minBound . range) (789 :: Int, 799)) -- ok ] ] ```

Conclusion:

@harendra-kumar

Bodigrim commented 1 year ago

Interesting, thanks. Did you try to find out which part of -O2 triggers this behavior? Does it have something to do with -fdicts-strict?

A documentation improvent would be welcome.

I imagine noinline is a systematic way to fix, depending on what exactly you want to benchmark. Often you do want to inline f itself, you just do not want it to be pre-evaluated.

wismill commented 1 year ago

So, I removed the default -fdicts-strict (implied by -O2 anyway) and applied separately the following flags to cancel some optimisations from -O2:

So it is -fspec-constr that trigger the issue. But using -fno-spec-constr is a hammer, as some optimisations do work with respect to benchmark.

wismill commented 1 year ago

@Bodigrim I am stuck on this one. Any idea how to investigate further?

Bodigrim commented 1 year ago

Tactically you can try tasty-bench < 0.3.4.

Could you possibly share your unminimized motivational example? I'd like to understand how small a function should be to get completely eliminated.

wismill commented 1 year ago

@Bodigrim The code is located in unicode-data bench directory. It was first added in this PR. Since it has been reviewed by experienced Haskellers (including you 😉), I think the issue is not trivial.

I think the issue comes down to the following function:

{-# INLINE foldString #-}
foldString :: forall a. (NFData a) => (Char -> a) -> String -> ()
foldString f = foldr (deepseq . f) ()

-- foldString is used to benchmark a function over a list.
-- The (simplified) benchmark code is:
bench "xxx" (nf (foldString f cs))
-- where cs is a precomputated String

Instead it should probably be more like the following, in order to ensure f is always evaluated on each list item:

{-# INLINE foldString #-}
foldString :: forall a. (NFData a) => (Char -> a) -> a -> String -> a
foldString f = foldr (\c -> (`deepseq` f c))

The drawback is that one must provide a default value.

I think we should document tasty-bench on how to bench a function over a Foldable, avoiding pitfall such as depending on function size.

Bodigrim commented 1 year ago

That's all pretty annoying and AFAICT caused by my fix of https://github.com/Bodigrim/tasty-bench/issues/44.

I am open to improving documentation, but I'm not sure what to say beyond plain nf (map f . range) (lo, hi). What's the reason you invoke deepseq inside of nf?

A more fundamental fix is that tasty-bench can measure how long an empty loop takes. If a benchmark measurement is close to an empty loop, we can throw a warning that something's wrong.

wismill commented 1 year ago

I'm not sure what to say beyond plain nf (map f . range) (lo, hi). What's the reason you invoke deepseq inside of nf?

The issue with nf (map f . range) (lo, hi) in general is that the list is built.

In the original bench code, using foldr with range allowed to avoid this issue. I used deepseq to force each range item. This item may be a list or an enum member. But in our current code a list is actually built using env, because we filter out some Unicode code points and this should not be part of the benchmark. Still, nf (map f) cs allocates a new list. I think using foldr in some way should allow us to focus on the function being benchmarked.

A more fundamental fix is that tasty-bench can measure how long an empty loop takes. If a benchmark measurement is close to an empty loop, we can throw a warning that something's wrong.

You mean to check this systematically? That would be excellent to avoid this trap! Would it be complex to implement?

Bodigrim commented 1 year ago

Still, nf (map f) cs allocates a new list.

This sounds like a manifestation of https://github.com/Bodigrim/tasty-bench/issues/39. I wonder whether we should change the definition of nf as suggested in there.

You mean to check this systematically? That would be excellent to avoid this trap! Would it be complex to implement?

It should be doable. Each time when we run measure on an actual Benchmarkable, measure the same number of iterations for a dummy Benchmarkable which does nothing. If results are close (e. g., actual measurement is less than 3x of a dummy one), it's wonky and unreliable. This heuristic however might need some tuning, one has to experiment a bit.

Bodigrim commented 1 year ago

It should be doable.

Not really unfortunately. Loops in bgroup "range (rnf on function)" are not completely empty: they do iterate over range, just do nothing else.

Bodigrim commented 1 year ago

https://github.com/Bodigrim/tasty-bench/blob/8b9f5850d1396d4be3eff0dd882661ec6b2f85ca/src/Test/Tasty/Bench.hs#L1313-L1318

AFAIU what happens is that under certain aggressive optimization flags GHC is able to see that

If both conditions hold GHC just throws away _ <- evaluate (frc (f x)) and spins an empty loop.

Bodigrim commented 1 year ago

The issue with nf (map f . range) (lo, hi) in general is that the list is built.

Ideally this should just work. One problem is that unsaturated application of range will not inline. It's better to write nf (\x -> map f [lo..x]) hi. Another problem is that deepseq for lists does not fuse, this is tracked in https://github.com/haskell/deepseq/pull/99 now.

Bodigrim commented 6 months ago

I extended the documentation a bit in eb9aa0e, but struggle to give a better advice. Feel free to reopen if there are more suggestions / ideas.

wismill commented 3 months ago

@Bodigrim FYI: based on your insights here and in #56, I chose to use pinned ByteArray# and noinline:

Ideally we would have a detector for excessive inlining. AFAIK there are 3 directions:

Well, there is another possibility: control inlining with pragmas and phases control.

That’s a lot to investigate for me, but it seems worth in order to understand and control GHC optimizations framework. Any pointer would be welcome!

Bodigrim commented 3 months ago

Core plugins are able to inspect Core, but it's not clear to me what pattern to look for. Maybe just print Core for manual inspection?.. In such case inspection-testing would do.

If bcompareWithin can do the job, I'd rather not introduce additional mechanisms.

From your comment above, could we simulate that the function can raise an exception to avoid the empty loop?

It would be better to talk to someone more knowledgeable than me, I'm far from expert in GHC optimizations. But even if my analysis there was correct, introducing an inaccessible exception could affect measurements for benchmarks, unaffected by the problem.