Closed wismill closed 6 months ago
Interesting, thanks. Did you try to find out which part of -O2
triggers this behavior? Does it have something to do with -fdicts-strict
?
A documentation improvent would be welcome.
I imagine noinline
is a systematic way to fix, depending on what exactly you want to benchmark. Often you do want to inline f
itself, you just do not want it to be pre-evaluated.
So, I removed the default -fdicts-strict
(implied by -O2
anyway) and applied separately the following flags to cancel some optimisations from -O2
:
-fno-dicts-strict
: no significant change-fno-asm-shortcutting
: no significant change-fno-stg-lift-lams
: no significant change-fno-liberate-case
: no significant change-fno-fast-pap-calls
: no significant change-fno-spec-constr
: significant change: act as if -O1
was used.So it is -fspec-constr
that trigger the issue. But using -fno-spec-constr
is a hammer, as some optimisations do work with respect to benchmark.
@Bodigrim I am stuck on this one. Any idea how to investigate further?
Tactically you can try tasty-bench < 0.3.4
.
Could you possibly share your unminimized motivational example? I'd like to understand how small a function should be to get completely eliminated.
@Bodigrim The code is located in unicode-data
bench directory. It was first added in this PR. Since it has been reviewed by experienced Haskellers (including you 😉), I think the issue is not trivial.
I think the issue comes down to the following function:
{-# INLINE foldString #-}
foldString :: forall a. (NFData a) => (Char -> a) -> String -> ()
foldString f = foldr (deepseq . f) ()
-- foldString is used to benchmark a function over a list.
-- The (simplified) benchmark code is:
bench "xxx" (nf (foldString f cs))
-- where cs is a precomputated String
Instead it should probably be more like the following, in order to ensure f
is always evaluated on each list item:
{-# INLINE foldString #-}
foldString :: forall a. (NFData a) => (Char -> a) -> a -> String -> a
foldString f = foldr (\c -> (`deepseq` f c))
The drawback is that one must provide a default value.
I think we should document tasty-bench
on how to bench a function over a Foldable
, avoiding pitfall such as depending on function size.
That's all pretty annoying and AFAICT caused by my fix of https://github.com/Bodigrim/tasty-bench/issues/44.
I am open to improving documentation, but I'm not sure what to say beyond plain nf (map f . range) (lo, hi)
. What's the reason you invoke deepseq
inside of nf
?
A more fundamental fix is that tasty-bench
can measure how long an empty loop takes. If a benchmark measurement is close to an empty loop, we can throw a warning that something's wrong.
I'm not sure what to say beyond plain
nf (map f . range) (lo, hi)
. What's the reason you invokedeepseq
inside ofnf
?
The issue with nf (map f . range) (lo, hi)
in general is that the list is built.
In the original bench code, using foldr
with range
allowed to avoid this issue. I used deepseq
to force each range item. This item may be a list or an enum member. But in our current code a list is actually built using env
, because we filter out some Unicode code points and this should not be part of the benchmark. Still, nf (map f) cs
allocates a new list. I think using foldr
in some way should allow us to focus on the function being benchmarked.
A more fundamental fix is that tasty-bench can measure how long an empty loop takes. If a benchmark measurement is close to an empty loop, we can throw a warning that something's wrong.
You mean to check this systematically? That would be excellent to avoid this trap! Would it be complex to implement?
Still,
nf (map f) cs
allocates a new list.
This sounds like a manifestation of https://github.com/Bodigrim/tasty-bench/issues/39. I wonder whether we should change the definition of nf
as suggested in there.
You mean to check this systematically? That would be excellent to avoid this trap! Would it be complex to implement?
It should be doable. Each time when we run measure
on an actual Benchmarkable
, measure the same number of iterations for a dummy Benchmarkable
which does nothing. If results are close (e. g., actual measurement is less than 3x of a dummy one), it's wonky and unreliable. This heuristic however might need some tuning, one has to experiment a bit.
It should be doable.
Not really unfortunately. Loops in bgroup "range (rnf on function)"
are not completely empty: they do iterate over range
, just do nothing else.
AFAIU what happens is that under certain aggressive optimization flags GHC is able to see that
frc (f x)
cannot throw an exception,frc (f x)
nor f x
are referenced elsewhere, so no point in forcing the thunk.If both conditions hold GHC just throws away _ <- evaluate (frc (f x))
and spins an empty loop.
The issue with
nf (map f . range) (lo, hi)
in general is that the list is built.
Ideally this should just work. One problem is that unsaturated application of range
will not inline. It's better to write nf (\x -> map f [lo..x]) hi
. Another problem is that deepseq
for lists does not fuse, this is tracked in https://github.com/haskell/deepseq/pull/99 now.
I extended the documentation a bit in eb9aa0e, but struggle to give a better advice. Feel free to reopen if there are more suggestions / ideas.
@Bodigrim FYI: based on your insights here and in #56, I chose to use pinned ByteArray#
and noinline
:
ByteArray#
in benchmarks groups instead of creating lists, to reduce pressure on GC.noinline
to avoid excessive inlining. I went for using it as a big Hammer for all functions, instead of checking which function requires it. So it disadvantage some functions, but all in all this simplify maintenance, else we would have to check the requirement every time.Ideally we would have a detector for excessive inlining. AFAIK there are 3 directions:
tasty-bench
a reference function and a factor k
in addition to the real function to bench. That real function should run at least k
times slower than the ref function, similar to --fail-if-faster
. Wait, I just realize we have bcompareWithin
! But this seems like a hack that would require to maintain both loops in sync and add dummy bench to the results.text
to check fusion), so there may be a compiler plugin or something that may help here with inlining.Well, there is another possibility: control inlining with pragmas and phases control.
That’s a lot to investigate for me, but it seems worth in order to understand and control GHC optimizations framework. Any pointer would be welcome!
Core plugins are able to inspect Core, but it's not clear to me what pattern to look for. Maybe just print Core for manual inspection?.. In such case inspection-testing
would do.
If bcompareWithin
can do the job, I'd rather not introduce additional mechanisms.
From your comment above, could we simulate that the function can raise an exception to avoid the empty loop?
It would be better to talk to someone more knowledgeable than me, I'm far from expert in GHC optimizations. But even if my analysis there was correct, introducing an inaccessible exception could affect measurements for benchmarks, unaffected by the problem.
I realized that it is really easy to be fooled by excessive inlining.
The following file illustrates this using GHC 9.4.5:
-O1
, then every function is benched normally.-O2
, then some functions are optimized away (see comments). This is visible inCore
output.The functions are trivial, but I got this issue with less trivial functions in
unicode-data
.
```haskell #!/usr/bin/env cabal -- Make this executable and run it with ./Bench.hs {- cabal: build-depends: base >= 4.16 && < 4.19, deepseq >= 1.4 && < 1.5, tasty-bench >= 0.3.4 && < 0.4, ghc-options: -O2 -Wall -fdicts-strict -rtsopts -with-rtsopts=-A32m -fproc-alignment=64 -ddump-simpl -ddump-stg-final -ddump-cmm -ddump-to-file -dsuppress-idinfo -dsuppress-coercions -dsuppress-type-applications -} module Main where import Test.Tasty.Bench (defaultMain, bench, nf, bgroup) import Control.DeepSeq (NFData (..), deepseq) import Data.Ix (Ix(..)) newtype MyInt = MyInt Int instance NFData MyInt where {-# NOINLINE rnf #-} rnf (MyInt a) = rnf a {-# INLINE f #-} f :: Int -> Int f cp = if cp < 790 then negate cp else cp main :: IO () main = defaultMain [ bgroup "single value" [ bench "negate Int" (nf negate (789 :: Int)) -- not ok: rnf & negate too small , bench "negate MyInt" (nf (MyInt . negate) (789 :: Int)) -- ok: rnf not inlined , bench "negate maybe" (nf (Just . negate) (789 :: Int)) -- not ok: rnf & negate too small , bench "negate list" (nf ((:[]) . negate) (789 :: Int)) -- ok: rnf big enough , bench "gcd" (nf (gcd 123) (789 :: Int)) -- ok: gcd big enough , bench "f" (nf f (789 :: Int)) -- ok: f big enough ] , bgroup "range (rnf on function)" [ bench "negate MyInt" (nf (foldr (\n -> deepseq (MyInt (negate n))) () . range) (789 :: Int, 799)) -- not ok , bench "gcd" (nf (foldr (\n -> deepseq (gcd n 123)) () . range) (789 :: Int, 799)) -- ok , bench "f" (nf (foldr (\c -> deepseq (f c)) () . range) (789 :: Int, 799)) -- not ok ] , bgroup "range (rnf on accumulator)" [ bench "negate MyInt" (nf (foldr (\n -> (`deepseq` (MyInt (negate n)))) (MyInt minBound) . range) (789 :: Int, 799)) -- ok , bench "gcd" (nf (foldr (\n -> (`deepseq` (gcd n 123))) minBound . range) (789 :: Int, 799)) -- ok , bench "f" (nf (foldr (\c -> (`deepseq` (f c))) minBound . range) (789 :: Int, 799)) -- ok ] ] ```Bench.hs
Conclusion:
Core
output?@harendra-kumar