composewell / streamly

High performance, concurrent functional programming abstractions
https://streamly.composewell.com
Other
849 stars 63 forks source link

Document to users how to detect/fix fusion breakage #2747

Open shlok opened 1 month ago

shlok commented 1 month ago

Thanks as always for a great library. I came across a thing I'd like to discuss.

Example (my setup: streamly-0.10.1; streamly-core-0.2.2; GHC 9.4.8):

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}

module Main where

import Control.Monad
import Data.Function
import qualified Streamly.Data.Fold as F
import qualified Streamly.Data.Stream.Prelude as S
import Text.Read

main :: IO ()
main = do
  y :: Bool <- getLine >>= maybe (error "invalid input") return <$> readMaybe
  z :: Bool <- getLine >>= maybe (error "invalid input") return <$> readMaybe
  w :: Bool <- getLine >>= maybe (error "invalid input") return <$> readMaybe
  void $
    S.fromList @IO [1 :: Int ..]
      & S.take 10
      & S.mapM
        ( \i -> do
            print i
            return $ i + 77 -- (Feel free to ignore; used in additions below.)
        )
      & S.fold (fld y z w)

-- (Feel free to ignore; if we want we can search for e.g. "1076#" in the core dump.)
{-# INLINE fld #-}
fld :: (Monad m) => Bool -> Bool -> Bool -> F.Fold m Int Int
fld False False False = F.foldl' (\acc x -> acc + x + 111) 0 -- ;111+77=188
fld False False True = F.foldl' (\acc x -> acc + x + 222) 0 -- ;222+77=299
fld False True False = F.foldl' (\acc x -> acc + x + 333) 0 -- ;333+77=410
fld False True True = F.foldl' (\acc x -> acc + x + 444) 0 -- ;444+77=521
fld True False False = F.foldl' (\acc x -> acc + x + 555) 0 -- ;555+77=632
fld True False True = F.foldl' (\acc x -> acc + x + 777) 0 -- ;777+77=854
fld True True False = F.foldl' (\acc x -> acc + x + 888) 0 -- ;888+77=965
fld True True True = F.foldl' (\acc x -> acc + x + 999) 0 -- ;999+77=1076

Issue:

Proposed solution (feel free to provide better ones):

harendra-kumar commented 4 weeks ago

Did you use fusion-plugin? Almost all fusion issues are taken care of by the fusion-plugin and by inlining all the functions that are part of the fused loop. The compilation guide recommends using the plugin for fusion.

fusion-plugin has several verbosity levels and it reports all the cases where fusion breaks. It also has a feature to print outputs from each optimization pass in separate files to compare the outputs pass by pass, but this has been broken in recent compilers.

Some guidelines are provided in the optimization guide.

Let me know if your problem is not covered by these or if there are improvements required to the docs, or discoverability of the docs can be improved somehow.

shlok commented 1 week ago

@harendra-kumar Thanks; I just tried. It doesn't look like the plugin makes a difference (unless of course I'm somehow just using it wrong).

I used GHC 9.4.8 (to make sure fusion-plugin's core dump feature works) and used these compiler options as instructed:

-Wall
-O2
-fdicts-strict
-fmax-worker-args=16
-fspec-constr-recursive=16
-fplugin=Fusion.Plugin
-fplugin-opt=Fusion.Plugin:dump-core

This is how I build my original example (with the & S.take 10 line that caused fusion to break): cabal clean && cabal build > ./core-dumps.txt.

I still see lots of Steps in the final main function.

Next I add -funfolding-use-threshold=1000 to the compiler options, and build like this: cabal clean && cabal build > ./core-dumps-with-extra-compiler-option.txt

Now the Steps in the final main functions are gone.

So -funfolding-use-threshold=1000 seems to be needed, with or without the plugin.

I have attached the two files (only the final step to make the files smaller) so you can see the difference:

core-dumps.txt core-dumps-with-extra-compiler-option.txt