gelisam / hawk

Haskell text processor for the command-line
Apache License 2.0
362 stars 20 forks source link

Type-checked expression manipulation #106

Closed gelisam closed 9 years ago

gelisam commented 10 years ago

wrapExpr and applyExpr both construct alternate user expressions based on the original user expression. I remember we had discussed that there must be a safer way to do this, using phantom types or template haskell or whatever.

gelisam commented 10 years ago

Branch issue-106 adds a simple system of phantom types to ensure that the user expression is manipulated in a type-correct way. The approach is limited, however, by the fact that we don't know in advance what the type of the user expression will be: at first, it could be anything, then we inspect the command (Eval, Apply or Map) and we learn a bit more, then we examine the InputFormat and we learn the final details. I made a table (s is ByteString, a is some instance of Rows, and xx means the combination is disallowed):

-d, -D; -D''
-a [[s]] -> a [s] -> a
-m [s] -> a xx
-e a a
-d'' -D; -D''
-a [s] -> a s -> a
-m s -> a xx
-e a a

But that's not how type inference works in Haskell: we can't gradually add constraints to the type of an argument as the values are examined, instead the argument has a particular concrete type, and each use of that argument adds zero or more constraints to that type. We cannot case-split on the flags to change the constraints we put on an argument, because all the branches of the case-split will add to the constraints, not just the one which ends up being followed.

I think there might still be a way to pull it off by using continuation-passing style...

gelisam commented 10 years ago

I think there might still be a way to pull it off by using continuation-passing style...

I no longer think it's going to work in our case, but here's what I had in mind.

Consider a simpler case, with the following table of types:

--list --tree
--string [String] Tree String
--int [Int] Tree Int

For this simplified problem, we're not trying to apply a user expression to an input stream, but simply to verify if the user expression has any empty values, where the definition of "empty" varies depending on the table cell we're in. Let's start with an untyped version, similar to Hawk's current implementation:

checkExprEmptiness1 :: [String] -> String -> IO Bool
checkExprEmptiness1 flags expr = interpret' (printf "%s %s (%s)" _any _isEmpty expr) (as :: Bool)
  where
    _any | "--list" `elem` flags = "any"
    _any | "--tree" `elem` flags = "anyLeaf"
    _any | otherwise = error "need to use --list or --tree"
    _isEmpty | "--string" `elem` flags = "null"
    _isEmpty | "--int"    `elem` flags = "(==0)"
    _isEmpty | otherwise = error "need to use --string or --int"

If we try to use HaskellExpr to give a precise type to those strings representing haskell expressions, then we get a type error, even if we say that the user expression can have any type:

checkExprEmptiness2 :: [String] -> (forall e. HaskellExpr e) -> IO Bool
checkExprEmptiness2 flags eExpr = interpretExpr eCheckEmptiness
  where
    eCheckEmptiness :: HaskellExpr Bool
    eCheckEmptiness = eAnyAnywhere `eAp` eIsEmpty `eAp` eExpr

    eAnyAnywhere :: HaskellExpr ((a -> Bool) -> f a -> Bool)
    eAnyAnywhere | "--list" `elem` flags = eAny      -- HaskellExpr ((a -> Bool) -> [a] -> Bool)
    eAnyAnywhere | "--tree" `elem` flags = eAnyLeaf  -- HaskellExpr ((a -> Bool) -> Tree a -> Bool)
    eAnyAnywhere | otherwise = error "need to use --list or --tree"

    eIsEmpty :: HaskellExpr (a -> Bool)
    eIsEmpty | "--string" `elem` flags = eNull            -- HaskellExpr ([a] -> Bool)
    eIsEmpty | "--int"    `elem` flags = eEq `eAp` eZero  -- HaskellExpr (Int -> Bool)
    eIsEmpty | otherwise = error "need to use --string or --int"

We get type errors because the type signatures for eAnyAnywhere and eIsEmpty claim to work for any types f and a instead of saying that an appropriate f and a will be determined at runtime. It sounds like existential types should be the answer, but I had to jump through hoops and even after that I couldn't get them to work. Something about "skolem" type variables...

Instead, let's use continuation-passing-style to precisely distinguish the part of the type which is computed by eAnyAnywhere and eIsEmpty from the part of the type which, from their point of view, could be anything. Conversely, from the point of view of eCheckEmptiness, the part of the type which is computed by the helper functions is the part which could be anything.

checkExprEmptiness3 :: [String] -> (forall e. HaskellExpr e) -> IO Bool
checkExprEmptiness3 flags eExpr = interpretExpr eCheckEmptiness
  where
    eCheckEmptiness :: HaskellExpr Bool
    eCheckEmptiness = eAnyAnywhere $ \eAA ->
                      eIsEmpty $ \eIE ->
                      eAA `eAp` eIE `eAp` eExpr

    eAnyAnywhere :: (forall f. (forall a. HaskellExpr ((a -> Bool) -> f a -> Bool))
                            -> r)
                 -> r
    eAnyAnywhere cc | "--list" `elem` flags = cc eAny
                    | "--tree" `elem` flags = cc eAnyLeaf
                    | otherwise = error "need to use --list or --tree"

    eIsEmpty :: (forall a. HaskellExpr (a -> Bool)
                        -> r)
             -> r
    eIsEmpty cc | "--string" `elem` flags = cc eNull
                | "--int"    `elem` flags = cc (eEq `eAp` eZero)
                | otherwise = error "need to use --string or --int"

The types are getting quite hairy, but it works! The most tricky part is that eAnyAnywhere needs two nested foralls with different meanings: the first forall means that the continuation must be willing to accept any f, whereas the second forall means that the value which eAnyAnywhere sends to this continuation is polymorphic in a.

gelisam commented 10 years ago

The above example worked fine, but only because the type table was very structured. Let's consider a slightly more complicated example.

--list --tree
--string String [String] Tree String
--int Int [Int] Tree Int

This table is more complicated because the entries no longer all have the form f a. This was an important feature, because eAnyAnywhere was only affecting the type variable f, while eIsEmpty was only affecting the type variable a. Can we divide the work even if there isn't one type variable per axis?

In this particular case, we can, at the expense of making the types even more complicated.

    eCheckEmptiness :: HaskellExpr Bool
    eCheckEmptiness = eAnyAnywhere $ \eAA ePreprocess ->
                      eIsEmpty $ \eIE ->
                      eAA `eAp` eIE `eAp` (ePreprocess eExpr)

    eAnyAnywhere :: (forall f. (forall a. HaskellExpr ((a -> Bool) -> f a -> Bool))
                            -> (forall a. (forall e. HaskellExpr e) -> HaskellExpr (f a))
                            -> r)
                 -> r
    eAnyAnywhere cc | "--list" `elem` flags = cc eAny (\e -> e)
                    | "--tree" `elem` flags = cc eAnyLeaf (\e -> e)
                    | otherwise = cc eAnyIdentity (\e -> eIdentity `eAp` e)

My solution is to add an extra preprocessing step which forces the user expression to have the form f a, by wrapping the user expression in an Identity if the f is missing, and otherwise leaving the user expression alone. I need to use explicit lambdas because id doesn't handle foralls very well.

gelisam commented 10 years ago

The next step towards supporting a type table as complex as Hawk's is to introduce illegal cells. For example:

--list --tree
Int [Int] Tree Int
--reverse xx [Int] Tree Int

I don't know how to handle this case yet, and in fact, I'm not sure I want to. Sometimes, it's better to have an untyped solution than to have a precisely-typed solution with incomprehensible types!

gelisam commented 10 years ago

Let's figure out the precisely-typed-solution-with-incomprehensible-types anyway. For Science!

First, here is an alternate solution to the previous problem. Instead of forcing all the solutions into the same f a shape, we now use an explicit datatype detailing the different possible shapes (f a and a) which isAnyAnywhere can return, and which eCheckEmptiness must therefore support.

data EmptinessChecker f
  = ContainerChecker (forall a. HaskellExpr ((a -> Bool) -> f a -> Bool))
  | ValueChecker     (forall a. HaskellExpr ((a -> Bool) ->   a -> Bool))

checkExprEmptiness5 :: [String] -> (forall e. HaskellExpr e) -> IO Bool
checkExprEmptiness5 flags eExpr = interpretExpr eCheckEmptiness
  where
    eCheckEmptiness :: HaskellExpr Bool
    eCheckEmptiness = eIsEmpty $ \eIE ->
                      eAnyAnywhere $ \case
                        ContainerChecker eAA -> eAA `eAp` eIE `eAp` eExpr
                        ValueChecker     eAA -> eAA `eAp` eIE `eAp` eExpr

    eAnyAnywhere :: (forall f. EmptinessChecker f
                            -> r)
                 -> r
    eAnyAnywhere cc | "--list" `elem` flags = cc (ContainerChecker eAny)
                    | "--tree" `elem` flags = cc (ContainerChecker eAnyLeaf)
                    | otherwise = cc (ValueChecker eId)

    eIsEmpty :: (forall a. HaskellExpr (a -> Bool)
                        -> r)
             -> r
    eIsEmpty cc | "--string" `elem` flags = cc eNull
                | "--int"    `elem` flags = cc (eEq `eAp` eZero)
                | otherwise = error "need to use --string or --int"

The cases for ContainerChecker and ValueChecker look identical, but the types of eAA are different and thus eExpr is instantiated at different types.

With such explicit datatypes describing the different possibilities, it should be no problem to detect incompatible combinations.

gelisam commented 10 years ago

Suppose that both [] and Tree are instances of a typeclass named Reversible. We might want to add this constraint to the type signature of eAnyAnywhere, so that eCheckEmptiness can assume it.

eAnyAnywhere :: (forall f. Reversible f => EmptinessChecker f -> r)
             -> r
eAnyAnywhere cc | "--list" `elem` flags = cc (ContainerChecker eAny)
                | "--tree" `elem` flags = cc (ContainerChecker eAnyLeaf)
                | otherwise = cc (ValueChecker eId)

Unfortunately, this triggers an error on the last line, because f is ambiguous.

Interestingly, f was also ambiguous in the previous example, which type-checked just fine. I think ghc usually doesn't care about ambiguous type variables, but that when a type constraint is present, now it does have to care because it has to produce a concrete dictionary instance to implement the constraint.

gelisam commented 10 years ago

Existential types to the rescue! For some reason, they do work this time.

data ContainerChecker = forall f. Reversible f =>
  ContainerChecker (forall a. HaskellExpr ((a -> Bool) -> f a -> Bool))

data ValueChecker =
  ValueChecker     (forall a. HaskellExpr ((a -> Bool) ->   a -> Bool))

checkExprEmptiness7 :: [String] -> (forall e. HaskellExpr e) -> IO Bool
checkExprEmptiness7 flags eExpr = interpretExpr eCheckEmptiness
  where
    eCheckEmptiness :: HaskellExpr Bool
    eCheckEmptiness = eIsEmpty $ \eIE ->
                      eAnyAnywhere $ \case
                        Left (ContainerChecker eAA) -> eAA `eAp` eIE `eAp` eExpr
                        Right (ValueChecker    eAA) -> eAA `eAp` eIE `eAp` eExpr

    eAnyAnywhere :: (Either ContainerChecker ValueChecker -> r)
                 -> r
    eAnyAnywhere cc | "--list" `elem` flags = cc (Left (ContainerChecker eAny))
                    | "--tree" `elem` flags = cc (Left (ContainerChecker eAnyLeaf))
                    | otherwise = cc (Right (ValueChecker eId))

    eIsEmpty :: (forall a. HaskellExpr (a -> Bool)
                        -> r)
             -> r
    eIsEmpty cc | "--string" `elem` flags = cc eNull
                | "--int"    `elem` flags = cc (eEq `eAp` eZero)
                | otherwise = error "need to use --string or --int"

Instead of attaching f to the EmptinessChecker datatype and letting it be ambiguous in the ValueChecker case, we split EmptinessChecker into two subtypes. Again, there are two distinct uses of the forall keyword: the forall f. Reversible f says that there exists a reversible type f which the ContainerChecker can check, while the forall a's say that the ContainerChecker and the ValueChecker are polymorphic in a.

Now that the forall's have been moved out of eAnyAnywhere's type signature, there is no longer any need for continuation-passing-style. We can simplify the types a little bit!

data ContainerChecker = forall f. Reversible f =>
  ContainerChecker (forall a. HaskellExpr ((a -> Bool) -> f a -> Bool))

data ValueChecker =
  ValueChecker     (forall a. HaskellExpr ((a -> Bool) ->   a -> Bool))

data IsEmpty = forall a.
  IsEmpty (HaskellExpr (a -> Bool))

checkExprEmptiness8 :: [String] -> (forall e. HaskellExpr e) -> IO Bool
checkExprEmptiness8 flags eExpr = interpretExpr eCheckEmptiness
  where
    eCheckEmptiness :: HaskellExpr Bool
    eCheckEmptiness = case eIsEmpty of
      IsEmpty eIE -> case eAnyAnywhere of
        Left (ContainerChecker eAA) -> eAA `eAp` eIE `eAp` eExpr
        Right (ValueChecker    eAA) -> eAA `eAp` eIE `eAp` eExpr

    eAnyAnywhere :: Either ContainerChecker ValueChecker
    eAnyAnywhere | "--list" `elem` flags = Left (ContainerChecker eAny)
                 | "--tree" `elem` flags = Left (ContainerChecker eAnyLeaf)
                 | otherwise = Right (ValueChecker eId)

    eIsEmpty :: IsEmpty
    eIsEmpty | "--string" `elem` flags = IsEmpty eNull
             | "--int"    `elem` flags = IsEmpty (eEq `eAp` eZero)
             | otherwise = error "need to use --string or --int"

Now all the complexity is at the top, inside the datatype declarations. And it's not even that complicated: it's some polymorphic type, wrapped in a HaskellExpr type constructor. Some of the type variables are determined by the code which constructs the value, and some of the type variables are determined by the code which uses the value. The type variables for those two cases are bound by two forall's with different semantics.

Or maybe the types are still complicated, but I have spent so much time on this problem that I finally have an intuition for them :)

gelisam commented 10 years ago

We are finally ready to tackle the case of --reverse:

data UserExpression =
  UserExpression (forall e. HaskellExpr e)

data ContainerChecker = forall f. Reversible f =>
  ContainerChecker (forall a. HaskellExpr ((a -> Bool) -> f a -> Bool))

data ValueChecker =
  ValueChecker     (forall a. HaskellExpr ((a -> Bool) ->   a -> Bool))

data IsEmpty = forall a.
  IsEmpty (HaskellExpr (a -> Bool))

data ContainerTransformer =
  ContainerTransformer (forall f a. Reversible f =>
                        HaskellExpr (f a -> f a))

data ValueTransformer =
  ValueTransformer (forall a. HaskellExpr (a -> a))

checkExprEmptiness9 :: [String] -> UserExpression -> IO Bool
checkExprEmptiness9 flags eExpr = interpretExpr eCheckEmptiness
  where
    eCheckEmptiness :: HaskellExpr Bool
    eCheckEmptiness = case eExpr of
      UserExpression eUE -> case eIsEmpty of
        IsEmpty eIE -> case eAnyAnywhere of
          Left (ContainerChecker eAA) -> case eTransform of
            Left (ContainerTransformer eT) -> eAA `eAp` eIE `eAp` (eT `eAp` eUE)
            Right (ValueTransformer    eT) -> eAA `eAp` eIE `eAp` (eT `eAp` eUE)
          Right (ValueChecker    eAA) -> case eTransform of
            Left (ContainerTransformer eT) -> error "--reverse needs a container"
            Right (ValueTransformer    eT) -> eAA `eAp` eIE `eAp` (eT `eAp` eUE)

    eAnyAnywhere :: Either ContainerChecker ValueChecker
    eAnyAnywhere | "--list" `elem` flags = Left (ContainerChecker eAny)
                 | "--tree" `elem` flags = Left (ContainerChecker eAnyLeaf)
                 | otherwise = Right (ValueChecker eId)

    eTransform :: Either ContainerTransformer ValueTransformer
    eTransform | "--reverse" `elem` flags = Left (ContainerTransformer eGenericReverse)
               | otherwise                = Right (ValueTransformer eId)

    eIsEmpty :: IsEmpty
    eIsEmpty | "--string" `elem` flags = IsEmpty eNull
             | "--int"    `elem` flags = IsEmpty (eEq `eAp` eZero)
             | otherwise = error "need to use --string or --int"

Since we pattern-match on the four different cases, we can explicitly throw an error when invalid flags are combined. More importantly, if we do try to compute a value instead of throwing an error, we get a type error at compile time. After all, that's the whole point of using HaskellExpr instead of String!

gelisam commented 10 years ago

I think this is a duplicate of issue #40. Since this issue contains a lot more information than #40, should we close #40 instead?

melrief commented 10 years ago

Yes, let's continue here. I'm still busy with the thesis, I will read this thread when I have time. In the meanwhile, good luck :).

gelisam commented 10 years ago

Ok, so back to the problem at hand. To recap, we have an untyped expression which we want to manipulate in a type safe way. I'm having a lot of trouble doing so, but why? What makes this problem so difficult?

I thought it was because the type of the user expression depends on the flags, but that's not it. The untyped expression can simply be given the type forall e. HaskellExpr e, as above, meaning it can have any type we need it to have. Depending on the flags, we then specialize e to be the type we need, for example [[ByteString]] -> Int.

I then thought that the difficulty was caused by the interaction between the type restrictions imposed by the different flags. But that's not it either! If all the type restrictions are given in one place, then everything typechecks just fine:

-- using ordinary types instead of phantom types, for simplicity
process :: Action -> (forall e. e) -> InputFormat -> IO ()
process Eval  e _              = print $ (e :: ())
process Apply e BareString     = print $ (e :: String -> ()) bareString
process Apply e ListString     = print $ (e :: [String] -> ()) listString
process Apply e ListListString = print $ (e :: [[String]] -> ()) listListString
process Map   _ BareString     = error "Map is incompatible with BareString."
process Map   e ListString     = print $ map (e :: String -> ()) listString
process Map   e ListListString = print $ map (e :: [String] -> ()) listListString

The problem is that Hawk's codebase isn't as simple as the above anymore. The code which deals with the action flags is in System.Console.Hawk, far from the code which deals with the input format flags, in System.Console.Hawk.Runtime. This makes things more complicated, because the type restrictions now affect more than just one branch, they now need to be carried from one function to another, and for this they need to be represented in some form. This representation is what all the machinery in my previous comments have been about.

Maybe it would be simpler to refactor Hawk so that all the type restrictions would occur in the same function once again?

gelisam commented 10 years ago

Hmm, turns out the forall e. HaskellExpr e trick isn't as good as I thought.

I still wanted to try separating the action-flags part from the input-flags part, so I came up with the following intermediate representation for a user expression on which the action flags have already been applied.

data CanonicalExpression
    = ProcessList (forall a. [a] -> HawkOutput)
    | ProcessValue (forall a. a -> HawkOutput)

data HawkOutput = forall a. Rows a => HawkOutput a

With a few more tricks, I managed to implement a version of System.Console.Hawk.wrapExpr which produces a CanonicalExpression, and a version of System.Console.Hawk.Runtime.processTable which accepts them. But then I still got type errors at runtime :(

First, the user expression produces a value of some type satisfying the Rows constraint. In order to convert this user expression into a CanonicalExpression, I need to wrap its output inside a HawkOutput constructor. This is easy to do; but it's also easy to forget to do so.

go1 :: HaskellExpr (a -> HawkOutput)
go1 = eConst `eAp` (eHawkOutput `eAp` eExpr)

-- oops! forgot to apply eHawkOutput.
go2 :: HaskellExpr (a -> HawkOutput)
go2 = eConst `eAp` eExpr

Unfortunately, if eExpr has type forall e. HaskellExpr e, then go2 typechecks and go1 doesn't! It doesn't typecheck because the type at which e should be instantiated is ambiguous. This is easily fixed with a type annotation, but the fact that go2 typechecks when it shouldn't is harder to fix. It typechecks because e can be instantiated at e ~ HawkOutput.

The other problem with the forall e. HaskellExpr e representation is that the user expression doesn't really have all possible types, it has exactly one type, we just don't know what it is yet. If we just instantiate e once at one particular type, then things are fine. With the above CanonicalExpression representation, I tried to delay this decision until the input-flags information was in, but the forall a. a -> HawkOutput forced the expression to support every possible input types. I only discovered my mistake much later, with a runtime error saying that [[ByteString]] doesn't match a.

gelisam commented 10 years ago

Oh, I know what my mistake was! I added a new primitive to convert between (forall a. HaskellExpr (f a)) and HaskellExpr (forall a. f a), but perhaps this isn't a sound conversion after all.

gelisam commented 9 years ago

History rewrite warning: I had to rebase this branch on top of master because it wasn't compiling with GHC 7.8.

gelisam commented 9 years ago

I'm trying to remember the status of this issue. My comments seem to indicate that there is still a runtime error, but all the tests pass. Maybe I've fixed that bug already?

gelisam commented 9 years ago

The current behaviour (on master) does not match the table at the top of this issue. An empty record separator implies that we don't split the fields either. So the first table should read:

-d, -D; -D''
-a [[s]] -> a s -> a
-m [s] -> a xx
-e a a

Note that the upper-right cell is now s -> a instead of [s] -> a.

gelisam commented 9 years ago

Hmm, my comments mention a type called CanonicalExpression, but I don't see that in issue-106's history. Maybe I did all of the above work in some separate test file, which I never checked in?

Yup, looking at the commits for this branch, it looks like I've implemented the "simple system of phantom types" I mention at the beginning, but that the type of the user expression is always forall a. a. That is, its type is not further restricted as each flag is processed.

gelisam commented 9 years ago

Let's recap. This forall a. a representation for the user expression is insufficient because we might think that we're instantiating it at a particular concrete type, such as [[B.ByteString]] -> (), but in fact we forgot some step or another and the forall swallows the difference, leading to an error at runtime instead of compile time. The easiest way to resolve this would be to always use a type annotation to say at which type we intend to use the user expression, but there are a few complications.

First, the code which determines which type the user expression should have is distributed among many files, so there is no single point in the code at which enough information is known for the type annotation to be put. Second, the different decisions regarding that type interact with each other in ways which do not combine well at the type level. Third, some flag combinations are illegal, so there cannot be a single universally-quantified type which summarizes the results of earlier decisions, there must at least be one constructor for the failure case and one for the success case. Finally, the types I came up with to work around some of those problems are way too complicated.

So here's a tentative solution with very simple types! Let's represent the user expression by a record of Maybe fields, one for each concrete type the user expression could have. At first, all the fields would be filled in, and as each flag is processed, some of the possibilities would be set to Nothing. This way, each decision can be made independently, fmapping an implementation segment over those user expression types that they support, and setting the others to Nothing. Once all the flags are processed, if there is exactly one Just field left, then that's the one we should use, if there is more than one, then the flag combination was somehow ambiguous, and if there are none, then the flags were conflicting.

One limitation of this approach is that it cannot check at compile time whether there exist some flag combinations which are ambiguous (there shouldn't) or conflicting (there should).

gelisam commented 9 years ago

It worked!!

There were more subtleties than I expected, but the transformation of the user transformation is now totally type-safe, and all the tests pass. I'm still a bit concerned about the safety of the part (now called canonicalizeExpression) where the input is packed into a [[ByteString]] and unpacked on the runtime side, but that is more an issue about totality than type-safety.

I have a few more ideas for improving upon this, so I'm not merging into master yet, but I've worked (on and off) on this for so long now, I'm just glad I finally found a working solution :)

gelisam commented 9 years ago

My improvement ideas didn't work out, so I'll merge the code as is.

The improvement I was thinking of was to extend HaskellExpr to support manipulating multiple alternate expressions (of different types) at once, that is, to create a generic framework instead of UserExpr's one-time workaround.

I think it would be possible to build such a system, but using such a system would make our code harder to read. Instead of being explicit about what to do with all four cases, we'd have generic methods for dropping some of the cases and duplicating them, and those methods would be called in a point-free style which might obscure the intent. Furthermore, since the types of each of the cases would be distinct, users of the system would still have to specify four type-signatures, even if they're using libraries to hide the four implementations. And fmap could not be used to lift a polymorphic function into the system, each such function would need a special version just for the system.

So let's keep things the way they are!

gelisam commented 9 years ago

Hmm, this is embarrassing. Didn't cabal install --enable-tests used to also silently run the tests?

Anyway, it turns out that cabal test doesn't actually do anything on my machine (it says my version of cabal is wrong), and when I run them with cabal-1.18, I see that I actually broke the tests. Oops!

gelisam commented 9 years ago

Phew! I only broke tests by removing an "unused import" which was actually used by a doctest. The tests do pass now, which means that my UserExpr work was indeed correct.