clash-lang / clash-compiler

Haskell to VHDL/Verilog/SystemVerilog compiler
https://clash-lang.org/
Other
1.44k stars 154 forks source link

Derived `Applicative` instance causes synthesis speed to blow up #2814

Open gergoerdi opened 1 month ago

gergoerdi commented 1 month ago

In the following Clash circuit, the MATRIX_APPLICATIVE_COERCE preprocessor flag enables an Applicative instance for the included Matrix type that is equivalent to what one would get with deriving via (Compose (Vec n) (Vec m)). With the flag disabled, the pure implementation is still the same as the derived one would be, but (<*>) is implemented directly as liftA2 (<*>).

With liftA2 (<*>), synthesis finishes in less than 4 seconds on my notebook. With the coerce-based implementation, it finishes in 1m46s. And this, of course, is a cut-down version of a much more complex circuit where the synthesis time goes from ~1 minute to something that I didn't even bother waiting out.

{-# LANGUAGE DerivingStrategies, DerivingVia, GeneralizedNewtypeDeriving #-}
{-# LANGUAGE UndecidableInstances #-}
{-# LANGUAGE InstanceSigs, StandaloneDeriving #-}
{-# LANGUAGE CPP #-}
{-# OPTIONS -Wunused-binds -Wunused-imports #-}
{-# OPTIONS -fconstraint-solver-iterations=5 #-}
module Sudoku where

#define MATRIX_APPLICATIVE_COERCE 0

import Clash.Prelude hiding (lift)
import Clash.Annotations.TH

import Data.Functor.Compose
import Data.Coerce (coerce)

newtype Matrix n m a = FromRows{ matrixRows :: Vec n (Vec m a) }
    deriving stock (Generic)
    deriving newtype (NFDataX, BitPack)
    deriving (Functor) via Compose (Vec n) (Vec m)
    -- deriving (Applicative) via Compose (Vec n) (Vec m)

instance (KnownNat n, KnownNat m) => Applicative (Matrix n m) where
    {-# INLINE pure #-}
    pure :: forall a. a -> Matrix n m a
    pure = coerce (pure @(Compose (Vec n) (Vec m)) @a)

    {-# INLINE (<*>) #-}
    (<*>) :: forall a b. Matrix n m (a -> b) -> Matrix n m a -> Matrix n m b
#if MATRIX_APPLICATIVE_COERCE
    (<*>) = coerce $ (<*>) @(Compose (Vec n) (Vec m)) @a @b
#else
    mf <*> mx = FromRows $ liftA2 (<*>) (matrixRows mf) (matrixRows mx)
#endif

toRowMajorOrder :: Matrix n m a -> Vec (n * m) a
toRowMajorOrder = concat . matrixRows

newtype Grid n m a = Grid{ getGrid :: Matrix n m (Matrix m n a) }
    deriving stock (Generic)
    deriving newtype (NFDataX)
    deriving (Functor, Applicative) via Compose (Matrix n m) (Matrix m n)

deriving newtype instance (KnownNat n, KnownNat m, BitPack a) => BitPack (Grid n m a)

flattenGrid :: Grid n m a -> Vec (n * m * m * n) a
flattenGrid = concat . toRowMajorOrder . fmap toRowMajorOrder . getGrid

headGrid :: forall n m a. (KnownNat n, KnownNat m, 1 <= n * m * m * n) => Grid n m a -> a
headGrid = head @(n * m * m * n - 1) . flattenGrid

type Solvable n m = (KnownNat n, KnownNat m, 1 <= n * m * m * n)

propagator
    :: forall n m dom. (Solvable n m, HiddenClockResetEnable dom)
    => SNat n -> SNat m -> Signal dom ()
propagator _ _ = headGrid units
  where
    units :: Grid n m (Signal dom ())
    units = pure unit
        <*> pure (pure ())
        <*> pure (pure ())
        <*> pure (pure ())
        <*> pure (pure ())

    unit
        :: Signal dom ()
        -> Signal dom ()
        -> Signal dom ()
        -> Signal dom ()
        -> Signal dom ()
    unit _ _ _ _ = cell
      where
        cell = register () cell

topEntity
    :: "CLK" ::: Clock System
    -> "RESET" ::: Reset System
    -> "OUT" ::: Signal System ()
topEntity clk rst = withClockResetEnable clk rst enableGen $
  propagator d2 d2

makeTopEntity 'topEntity
gergoerdi commented 1 month ago

Even simpler:

{-# LANGUAGE BlockArguments, ViewPatterns, MultiWayIf, RecordWildCards, ApplicativeDo #-}
{-# LANGUAGE DerivingStrategies, DerivingVia, GeneralizedNewtypeDeriving #-}
{-# LANGUAGE UndecidableInstances #-}
{-# LANGUAGE InstanceSigs, StandaloneDeriving #-}
{-# LANGUAGE CPP #-}
{-# OPTIONS -Wunused-binds -Wunused-imports #-}
{-# OPTIONS -fconstraint-solver-iterations=5 #-}
module Sudoku (topEntity) where

#define MATRIX_APPLICATIVE_COERCE 1

import Clash.Prelude hiding (lift)
import Clash.Annotations.TH

import Data.Functor.Compose
import Data.Coerce (coerce)

newtype Matrix n m a = FromRows{ matrixRows :: Vec n (Vec m a) }
    deriving stock (Generic)
    deriving newtype (NFDataX, BitPack)
    deriving (Functor) via Compose (Vec n) (Vec m)
    -- deriving (Applicative) via Compose (Vec n) (Vec m)

instance (KnownNat n, KnownNat m) => Applicative (Matrix n m) where
    {-# INLINE pure #-}
    pure :: forall a. a -> Matrix n m a
    pure = coerce (pure @(Compose (Vec n) (Vec m)) @a)

    {-# INLINE (<*>) #-}
    (<*>) :: forall a b. Matrix n m (a -> b) -> Matrix n m a -> Matrix n m b
#if MATRIX_APPLICATIVE_COERCE
    (<*>) = coerce $ (<*>) @(Compose (Vec n) (Vec m)) @a @b
#else
    mf <*> mx = FromRows $ zipWith (<*>) (matrixRows mf) (matrixRows mx)
#endif

headMatrix :: forall n m a. (KnownNat n, KnownNat m, 1 <= n, 1 <= m) => Matrix n m a -> a
headMatrix = head @(m - 1) . head @(n - 1) . matrixRows

type Solvable n m = (KnownNat n, KnownNat m, 1 <= n, 1 <= m)

propagator
    :: forall n m dom. (Solvable n m, HiddenClockResetEnable dom)
    => SNat n -> SNat m -> Signal dom ()
propagator _ _ = headMatrix . headMatrix . getCompose $ units
  where
    units :: Compose (Matrix n m) (Matrix m n) (Signal dom ())
    units = pure unit
        <*> pure (pure ())
        <*> pure (pure ())
        <*> pure (pure ())
        <*> pure (pure ())

    unit
        :: Signal dom ()
        -> Signal dom ()
        -> Signal dom ()
        -> Signal dom ()
        -> Signal dom ()
    unit _ _ _ _ = cell
      where
        cell = register () cell

topEntity
    :: "CLK" ::: Clock System
    -> "RESET" ::: Reset System
    -> "OUT" ::: Signal System ()
topEntity clk rst = withClockResetEnable clk rst enableGen $
  propagator d2 d2

makeTopEntity 'topEntity
christiaanb commented 1 month ago

More of an FYI: On GHC 9.10 both are equally fast. With the coerce based version actually requiring fewer transformations.

christiaanb commented 1 month ago

On GHC 9.4, commenting out https://github.com/clash-lang/clash-compiler/blob/3b755b900810c727833875b1a15222d69d37029e/clash-prelude/src/Clash/Sized/Vector.hs#L326-L328 also makes the coerce based version go through. Achieving a similar number of transformations, and similar speed, as we do on GHC 9.10. Note that on GHC 9.10 the above rule never fires.

christiaanb commented 1 month ago

Something to check: perhaps it's due to loss of sharing in the reduceNonRepPrim transformation.

gergoerdi commented 1 month ago

I would love to move on GHC 9.10, but I'm out of the loop on where Clash is on GHC compatibility. I thought 9.4 or 9.6 was the last working version?

DigitalBrains1 commented 1 month ago

We have a README.md in the root of the repo with a section GHC compatibility. Both Clash 1.8 and master support GHC 9.8. In fact, for master we also run CI for GHC 9.10, so it has some support. Perhaps we just forgot to update that table to list GHC 9.10.

It would appear the READMEs on the 1.8 branch and the v1.8.1 tag are outdated; that's unfortunate. But the one on the master branch is fairly reliable.

gergoerdi commented 1 month ago

Ah, right, it was the Cabal/Stack issue of not picking up the Nat plugins that was my issue earlier.