augustss / MicroHs

Haskell implemented with combinators
Other
363 stars 25 forks source link

let, case, do and layout #21

Closed dmjio closed 9 months ago

dmjio commented 10 months ago

Single line let in statements don't seem to be getting their layout applied correctly.

It seems like this is the "parse-error(t)" case in the Haskell Report section 10.3, where the parser has to get involved in the lexing phase.

layout (t : ts) (m : ms) = } : (L (t : ts) ms) | if m /= 0 and parse-error(t)

Examples below

main :: IO ()
main = putStrLn $ intercalate " " [ showToken t | t <- lexTop "let x = 3 in x" ]
-- let { x = 3 in x } 

-- ^ should be "let { x = 3 } in x"
main = mapM_ print (lexTop "let x = 3 in x")

{- tokens

TIdent (1,1) [] "let"                                                                                                                                                                                                                         
TSpec (1,5) '{'                                                                                                                                                                                                                               
TIdent (1,5) [] "x"                                                                                                                                                                                                                           
TIdent (1,7) [] "="                                                                                                                                                                                                                           
TInt (1,9) 3                                                                                                                                                                                                                                  
TIdent (1,11) [] "in"                                                                                                                                                                                                                         
TIdent (1,14) [] "x"                                                                                                                                                                                                                          
TSpec (0,0) '}'   

-}

There's a similar story with same-line case

In GHC the below is valid

λ> (case 1 of 1 -> 1, 2)  
(1,2)

But in mhs it will lex as:

( case 1 of { 1 -> 1 , 1 ) } 

{- tokens

TSpec (1,1) '('                                                                                                                                                                                                                                                                       
TIdent (1,2) [] "case"                                                                                                                                                                                                                                                                
TInt (1,7) 1                                                                                                                                                                                                                                                                          
TIdent (1,9) [] "of"                                                                                                                                                                                                                                                                  
TSpec (1,12) '{'                                                                                                                                                                                                                                                                      
TInt (1,12) 1                                                                                                                                                                                                                                                                         
TIdent (1,14) [] "->"                                                                                                                                                                                                                                                                 
TInt (1,17) 1                                                                                                                                                                                                                                                                         
TSpec (1,18) ','                                                                                                                                                                                                                                                                      
TInt (1,20) 1                                                                                                                                                                                                                                                                         
TSpec (1,21) ')'                                                                                                                                                                                                                                                                      
TSpec (0,0) '}'

-}

do as well

> putStrLn $ intercalate " " [ showToken t | t <- lexTop "if True then do putStrLn \"hey\" else do pure ()" ] 
if True then do { putStrLn "hey" else do { pure ( ) } } 

{- tokens

TIdent (1,1) [] "if"                                                                                                                                                                                                                                                                  
TIdent (1,4) [] "True"                                                                                                                                                                                                                                                                
TIdent (1,9) [] "then"                                                                                                                                                                                                                                                                
TIdent (1,14) [] "do"                                                                                                                                                                                                                                                                 
TSpec (1,17) '{'                                                                                                                                                                                                                                                                      
TIdent (1,17) [] "putStrLn"                                                                                                                                                                                                                                                           
TString (1,26) "hey"                                                                                                                                                                                                                                                                  
TIdent (1,32) [] "else"                                                                                                                                                                                                                                                               
TIdent (1,37) [] "do"                                                                                                                                                                                                                                                                 
TSpec (1,40) '{'                                                                                                                                                                                                                                                                      
TIdent (1,40) [] "pure"                                                                                                                                                                                                                                                               
TSpec (1,45) '('                                                                                                                                                                                                                                                                      
TSpec (1,46) ')'                                                                                                                                                                                                                                                                      
TSpec (0,0) '}'                                                                                                                                                                                                                                                                       
TSpec (0,0) '}' 

-}
augustss commented 10 months ago

That's right. There's a small comment about it in the README. The missing thing is insertion of a '}' when there is a syntax error. The Haskell report describes what should be done.

If you can figure out a simple way of doing it, I'm all ears.

dmjio commented 10 months ago

The side condition parse-error(t) is to be interpreted as follows: if the tokens generated so far by L together with the next token t represent an invalid prefix of the Haskell grammar, and the tokens generated so far by L followed by the token “}” represent a valid prefix of the Haskell grammar, then parse-error(t) is true.

It seems like we'd need to maintain an explicit list of tokens consumed and run them all through MicroHs.Parse.parse to determine if we have an invalid prefix. This sounds like it could be pretty terrible for performance depending on the program. It might also require mutually recursive imports between Lex.hs and Parse.hs.

There is John Meacham's attempt at a new layout algorithm (the ALR) here: https://www.mail-archive.com/haskell-prime@haskell.org/msg01938.html. He attempts to make explicit the parse-error(t) cases, but notes at the end additional cases remain.

His main innovation seems to be tracking the keyword that began a layout context in the layout stack (along with the indentation level) to determine if popping a layout context from the stack is appropriate.

-- VLCurly is only inserted to get a column number of the first lexeme after
-- a layout starter since we don't keep full positions of every lexeme in Token
-- for clarity.
data Token =
    Token String
    | TokenVLCurly String !Int
    | TokenNL !Int
    deriving(Show)

data Context
    = NoLayout String String  -- what opened it and what we expect to close it.
    | Layout String !Int
    deriving(Show)

layout :: [Token] -> [Context] -> [Token]
layout (TokenNL n:Token "in":rs) (Layout "let" n':ls) = rbrace:Token "in":layout rs ls
layout (TokenNL n:Token s:rs) (Layout h n':ls)
    | s `elem` layoutContinuers = layout (Token s:rs) (Layout h (min n' n):ls)
layout (TokenNL n:rs) (Layout h n':ls)
    | n == n' = semi:layout rs (Layout h n':ls)
    | n > n' = layout rs (Layout h n':ls)
    | n < n' = rbrace:layout (TokenNL n:rs) ls
layout (TokenNL _:rs) ls = layout rs ls
layout (TokenVLCurly h n:rs) (Layout h' n':ls)
    | n > n' = lbrace:layout rs (Layout h n:Layout h' n':ls)
    | otherwise = lbrace : rbrace : layout rs (Layout h' n':ls)
layout (TokenVLCurly h n:rs) ls = lbrace:layout rs (Layout h n:ls)
layout (t@(Token s):rs) (dropLayouts -> (n,Just (b,e),ls)) | s == e
    = replicate n rbrace ++ t:layout rs ls
layout (t@(Token s):rs) ls | Just e <- lookup s layoutBrackets
    = t:layout rs (NoLayout s e:ls)
layout (t@(Token s):rs) ls@(Layout c _:_) |
    Just e <- lookup c conditionalBrackets >>= lookup s = t:layout rs (NoLayout s e:ls)
layout (t@(Token "in"):rs) ls = case ls of
    Layout "let" n:ls -> rbrace:t:layout rs ls
    ls -> t:layout rs ls
layout (t@(Token ","):rs) (Layout "let" _:NoLayout "|" e:ls) = rbrace:layout (t:rs) (NoLayout "|" e:ls)
layout (t@(Token "where"):rs) ls = case ls of
    Layout l n : rest | l `elem` ["do","of"]
        -> rbrace : t : layout rs rest -- 'where' closes 'do' and 'case' on equal indentation.
    _otherwise -> t : layout rs ls

layout (t:rs) ls = t:layout rs ls
layout [] (Layout _ n:ls) = rbrace:layout [] ls
layout [] [] = []
layout x y = error $ "unexpected layout: " ++ show (x,y)

-- unwind all pending layouts
dropLayouts :: [Context] -> (Int,Maybe (String,String),[Context])
dropLayouts cs = f 0 cs where
    f n [] = (n,Nothing,[])
    f n (NoLayout b e:ls) = (n,Just (b,e),ls)
    f n (Layout {}:ls) = f (n + 1) ls

semi = Token ";"
lbrace = Token "{"
rbrace = Token "}"

fsts = map fst
snds = map snd

layoutStarters   = ["where","let","of","do"]

-- these symbols will never close a layout.
layoutContinuers = ["|","->","=",";",","]

-- valid in all contexts
layoutBrackets = [
    ("case","of"),
    ("if","then"),
    ("then","else"),
    ("(",")"),
    ("[","]"),
    ("{","}")
    ]

conditionalBrackets = [
    ("of",[("|","->")]),
    ("let",[("|","=")]),
    ("[",[("|","]")])
    ]

^ I might try to implement this and see if it handles more cases

melted commented 10 months ago

For a quick and dirty solution, I wonder if not just treating the ending brace in a block as optional in the parser will net the same effect, since if it works it's OK according to the spec and if it doesn't work there will be a guaranteed error from somewhere else, though likely with a highly misleading syntax error. There is the problem that there is no way in the parser to see which of the braces were in the source rather than generated by layout (and thus should be explicitly terminated).

augustss commented 9 months ago

Yeah, I was think that a parsing hack might get us most of the way. It's very difficult to actually do it right. It don't think any Haskell compiler handles this case correctly: case x of 0 -> 1 == 2 == 3 This should be converted to case x of { 0 -> 1 == 2 } == 3, since == is not associative. But operator precedence resolution is not done during parsing; we need to resolve imports first to know the fixity.

augustss commented 9 months ago

The most annoying one to me is let ... in .... It would be easy to make the } optional before in. It's not so easy for case and do since they don't have a terminator. But I was thinking that maybe inserting a backtracking point at every place where there could have been a '}', but there wasn't could be feasible.

augustss commented 9 months ago

I've figured out how to do it. Coming very soon.

dmjio commented 9 months ago

Awesome !

dmjio commented 9 months ago

TokenMachine is very clever (w/ Pop and Next states), this should probably go in the next Haskell Report (with instructions to use it with the let, where, do, and case blocks).

augustss commented 9 months ago

It would be more natural to represent the state of the token machine as a pair of context stack and remaining tokens, rather than as a partially applied function. But that was much slower, like 15% increase in compilation time. So that's why it looks that way.

A drawback to this approach is that if you backtrack (e.g., looking at the same token more than once), the token machine will recompute the next token every time. There might be some clever way to cache it, but I didn't bother since this seems fast enough.