parser/printer could be more lazy (to handle prefix of infinite stream)

jwaldmann commented 7 years ago

Just an idea. It seems that the input is evaluated fully, before any output appears.

Try pPrint $ repeat (False,()) which shows nothing, while show $ repeat (False,()) will produce output.

At least in a ghci session it's nice to see prefixes of (potentially) infinite streams appear as the are computed, as in filter (\q -> show q == reverse(show q)) $ map (^2) [1..] or whatever.

Now I see that is a challenge for the parser (it wants nested parentheses, so it will wait for the closing one always?) and for the printer (layout combinators may be strict as well).

cdepillabout commented 7 years ago

I think you're definitely right. It would be nice if pPrint could stream output the same way that the normal print function streams output.

However, like you say, I'm not sure how to do this in practice. The parser wants nested parenthesis (in order for the output printer to print the right amount of indentation), so I'm not sure how to go about implementing this.

If I were to take a guess at it, I might start playing around with the streaming parsers like pipes-attoparsec or conduit-extra's Data.Conduit.Attoparsec module.

I'm pretty sure there are incremental streaming parsers for things like json and xml, so I think it should work in theory.

If you (or anyone else) wants to take a shot at implementing this, I would be happy to help.

It might be easier to tackle this after implementing #16.

cdepillabout commented 6 years ago

Edward Kmett left a good comment about this on reddit:

One way to do it if you just wanted highlighting, rather than pretty printing, would be to simply apply a Haskell lexer rather than a full parser and highlight based solely on lexemes, and revert after the lexeme stream is illegal to just copying the source out. You might get some false highlighting at the start, but you'd be properly lazy.

Alternately you could give it an amount of 'fuel' to parse the original representation in full, and if it succeeds print prettily, falling back on the lexer approach to at least highlight. (If you make some other assumptions about Haskell's grammar like parens, braces, brackets are balanced, etc. you can still get some pretty printing formatting even in the lazy case, but it becomes pretty ad hoc.)

andrew-lei commented 6 years ago

Would something like this work? https://gist.github.com/andrew-lei/565d7fa9e71059a59d979371c604562c

andrew-lei commented 6 years ago

I've just realised a couple details are wrong (e.g. I've ignored legal Haskell operators and this only lexes for non-negative integers), but those should be able to be fixed.

What to do about non-legal Haskell?

cdepillabout commented 6 years ago

@andrew-lei Thanks for the work on this. I think something like your code posted above is definitely on the right track!

What to do about non-legal Haskell?

pretty-simple doesn't really care about Haskell syntax, per se. It is really only interested in parens, braces, brackets, commas, and quotes. Right now, the only "non-legal" input to pretty-simple is something where braces, brackets, etc is not balanced. However, ideally pretty-simple should still try to pretty-print these types of input.

There are three open issues related to this:

A version of "pretty-simple" that can stream output (like you're working on above), should ideally just ignore stray parens, braces, etc.

Issue https://github.com/cdepillabout/pretty-simple/issues/16 is about splitting pretty-simple's current parser into a lexer/parser. However, after thinking about it a little, it is possible that all we really need is a lexer. It might be possible to figure out the output indentation and color based entirely on lexed input instead of fully parsed input. I think your solution above is on the right track as far as this is concerned.

andrew-lei commented 6 years ago

pretty-simple doesn't really care about Haskell syntax, per se. It is really only interested in parens, braces, brackets, commas, and quotes. Right now, the only "non-legal" input to pretty-simple is something where braces, brackets, etc is not balanced. However, ideally pretty-simple should still try to pretty-print these types of input.

Yes, that's what I was thinking, but I was considering a few possible issues relating to lexing for a number.

1) It should be quite easy to write the lexer to capture a floating-point number in a single pass, but what happens if the lexer starts capturing a number, only to find something that would be non-standard Haskell? E.g. '123.1.1' or '123e5.5'

Since it would make most sense to store numbers as strings (since we aren't doing anything with numbers anyway) perhaps it would just make sense to store anything that starts being interpreted as a number as a string. Or we could stop lexing the rest as a number, and store the rest under another constructor, e.g. 'SyntaxError', which could be highlighted as red just as an unmatched paren would be. Or numbers could be parsed strictly instead of lazily, and if something turns out not to be a number, it gets read as Other or something, if we're not worried about some number whose string representation is potentially very long (which certainly seems unlikely but not sure if it can be ruled out). In any case, if numbers are lexed lazily I don't believe it would be possible to simply backtrack and say it isn't a number or something any more.

2) How exactly to handle negative numbers? The fact that Haskell does not allow a unary minus sign in a data constructor expression without being wrapped in parentheses probably simplifies things a bit. Instead of interpreting Foo -1 as Foo (-1) Haskell interprets it as Foo - 1 == (-) Foo 1 which is usually nonsensical. Meanwhile Foo (- 1) is the same as Foo (-1). So perhaps one solution to weird edge cases of operators containing minus signs (e.g. Foo (.- 1)) is to just not bother holding the minus sign as an operator, and always print out a space between the number if an operator or constructor appeared before it? I believe this would always result in Haskell that is semantically the same.

cdepillabout commented 6 years ago

Currently pretty-simple doesn't do anything special with numbers at all, so if you take the time to implement anything at all, it will be an improvement over the current state of affairs.

As far as I know, pretty-simple is only used for debugging. It is used for printing out large datatypes in a manner that is simple to read. I don't think we would lose any users if we don't handle floating points numbers and negative numbers perfectly.

For an initial implementation, I think just printing out all numbers in green (and ignoring floating point numbers and negative numbers) would be sufficient.

That being said, I do appreciate you taking the time to think about this. If you want to tackle how to get floating-point and negative numbers working, that would be great! Let me respond to your questions inline:

It should be quite easy to write the lexer to capture a floating-point number in a single pass, but what happens if the lexer starts capturing a number, only to find something that would be non-standard Haskell? E.g. '123.1.1' or '123e5.5'

Since it would make most sense to store numbers as strings (since we aren't doing anything with numbers anyway) perhaps it would just make sense to store anything that starts being interpreted as a number as a string. Or we could stop lexing the rest as a number, and store the rest under another constructor, e.g. 'SyntaxError', which could be highlighted as red just as an unmatched paren would be. Or numbers could be parsed strictly instead of lazily, and if something turns out not to be a number, it gets read as Other or something, if we're not worried about some number whose string representation is potentially very long (which certainly seems unlikely but not sure if it can be ruled out). In any case, if numbers are lexed lazily I don't believe it would be possible to simply backtrack and say it isn't a number or something any more.

I think you're on the right track here. I don't know if we'd be able to parse lazily as well as backtracking when something isn't a number.

I think you are right that it should be fine to store numbers as strings, since that is basically what we are treating them as.

I would be hesitant to treat poorly formatted numbers as a syntax error, since I can easily imagine the Show instance for a type like VersionNumber would output something like 1.2.3.4. This isn't a valid number, but it is definitely what the author of the Show instance intended.

Dates like UTCTime or Day (from the time package) are printed out as invalid numbers as well (I think), but they probably shouldn't be highlighted in red either.

How exactly to handle negative numbers? The fact that Haskell does not allow a unary minus sign in a data constructor expression without being wrapped in parentheses probably simplifies things a bit. Instead of interpreting Foo -1 as Foo (-1) Haskell interprets it as Foo - 1 == (-) Foo 1 which is usually nonsensical. Meanwhile Foo (- 1) is the same as Foo (-1). So perhaps one solution to weird edge cases of operators containing minus signs (e.g. Foo (.- 1)) is to just not bother holding the minus sign as an operator, and always print out a space between the number if an operator or constructor appeared before it?

If you wanted to try implementing this, I would say go for it! It sounds like it might work as you describe :+1:

andrew-lei commented 6 years ago

Yes, you are right about UTCTime and Day. To further complicate matters, Day uses dashes in between parts of the date (which could interfere with the negative case since a space after the dash would cause read to fail for Day; but since there is no dash in front, that probably doesn't matter), UTCTime has colons, etc.

Is the following fully general? A number

starts from a digit 0-9
followed by any number of non-space, non-alpha (except for 'e'), non-bracket, non-quote characters
but at most one 'e' (GHC interprets Foo 123e5e as Foo 123e5 e)

The moment the lexer for a number encounters a space, non-'e' alpha, bracket, quote, or a second 'e', that becomes part of the next token.

Or should a number be parsed strictly/eagerly, and a failed number then goes as an 'Other'?

cdepillabout commented 6 years ago

Is the following fully general? A number

starts from a digit 0-9

followed by any number of non-space, non-alpha (except for 'e'), non-bracket, non-quote characters

but at most one 'e' (GHC interprets Foo 123e5e as Foo 123e5 e)

The moment the lexer for a number encounters a space, non-'e' alpha, bracket, quote, or a second 'e', that becomes part of the next token.

These rules sound good to me.

This way, a UTCTime (which has a Show instance that looks like 2018-07-18 01:29:49.847053512 UTC) would be almost fully highlighted, except for the UTC string on the end.

Or should a number be parsed strictly/eagerly, and a failed number then goes as an 'Other'?

This would be fine with me as well. For instance, if some Show instance produced a token like 1foo, then I guess this should just be printed as Other. It might look weird for the 1 to be considered a Number and the rest of the token (foo) to be considered Other.

andrew-lei commented 6 years ago

OK, I have a working lazy parser in my fork of the repo. Proof of laziness:

> take 10 . unCommaSeparated . unBracket . head . expressionParse . show $ [1..]
[[Other "1"],[Other "2"],[Other "3"],[Other "4"],[Other "5"],[Other "6"],[Other "7"],[Other "8"],[Other "9"],[Other "10"]]

Almost all of the tests pass, I'll look into the ones that don't (see below). Unfortunately, pShow is still not lazy, so something in the render method must also be changed.

Failed test:

### Failure in src/Text/Pretty/Simple.hs:210: expression `pPrintOpt defaultOutputOptionsNoColor (1, (2, "foo\nbar\nbaz", 3))'
expected: ( 1
          ,
              ( 2
              , "foo
                bar
                baz"
              , 3
              )
          )
 but got: ( 1
          , 
              ( 2
              , "foo\nbar\nbaz" 
              , 3
              ) 
          ) 
Examples: 50  Tried: 47  Errors: 0  Failures: 1

cdepillabout commented 6 years ago

This looks great!

Feel free to fix up the tests as you see fit. If you need any help or have any questions with the tests let me know. I would be happy to help.

andrew-lei commented 6 years ago

The test is right, I believe I'm just not properly catching escaped characters (which I can work out later).

There seems to be several barriers to laziness for the outputs end. First, modificationsExprList relies on a function that ends up calling foldl over the expression list. I believe the problem is foldl is tail recursive and so doesn't terminate until it reaches the end. The fix is pretty easy, just replace with foldr.

I believe the other problems involve the state execution requiring a traversal to the end of the list, and the representation of the outputs as a strict Seq Output. Seq is required to be finite and strict, so lists (or something similar) must be used instead. I don't think there will be that much of a performance hit with using lists instead, though, since lazy concatenation of lists is supposed to be O(1), with O(n) only supposed to occur with a full traversal.

I'll have to think about how to deal with running the printer state, since the solution there isn't immediately obvious to me.

andrew-lei commented 6 years ago

I believe probably the way to handle the printer state is to move the output list out of the printer state and use a traverse with evalState rather than traverse_ with execState. Functions rather than having a return value of MonadState PrinterState m => m () should have a return value of MonadState PrinterState m => m [Output] or MonadState PrinterState m => m Output. The former would probably be easier but would require using concat for the traversal to join the resulting list of lists into a single list.

Also, is it possible to use the monadic parser combinator libraries lazily? In my solution I parsed it manually, but I suspect if I were more familiar with any of these libraries I could figure out a way to use them lazily. I haven't seen any examples of usage on infinite input.

cdepillabout commented 6 years ago

There seems to be several barriers to laziness for the outputs end. First, modificationsExprList relies on a function that ends up calling foldl over the expression list. I believe the problem is foldl is tail recursive and so doesn't terminate until it reaches the end. The fix is pretty easy, just replace with foldr.

I believe the other problems involve the state execution requiring a traversal to the end of the list, and the representation of the outputs as a strict Seq Output. Seq is required to be finite and strict, so lists (or something similar) must be used instead. I don't think there will be that much of a performance hit with using lists instead, though, since lazy concatenation of lists is supposed to be O(1), with O(n) only supposed to occur with a full traversal.

Your two proposed solutions here sound good to me.

I think it is more important to get lazy (streaming?) parsing/printing working than it is to make sure that pretty-simple is as fast a possible.

I believe probably the way to handle the printer state is to move the output list out of the printer state and use a traverse with evalState rather than traverse_ with execState. Functions rather than having a return value of MonadState PrinterState m => m () should have a return value of MonadState PrinterState m => m [Output] or MonadState PrinterState m => m Output. The former would probably be easier but would require using concat for the traversal to join the resulting list of lists into a single list.

This also sounds good to me. To be honest, I haven't really looked deeply at the output-printing Haskell code, so feel free to make any changes necessary to get this working! Even big sweeping changes are completely fine!

Also, is it possible to use the monadic parser combinator libraries lazily?

The problem with some of the parsing combinator libraries is their return type.

For instance here is parsec's runParser (changed a little to make the type easier to understand):

runParser :: Parsec a -> Text -> Either ParseError a

The problem here is the return type. The entire Parsec a has to be parsed before runParser knows whether it can return a Right or Left.

Your code from above is basically just returning a [Token]. There is no chance of the parser "failing".

I'm not sure what the best way to do this is.

Although now that I think about it, I haven't actually tried Parsec in this case. So I dunno, maybe it all just works out.

andrew-lei commented 6 years ago

Yes, that was my conclusion about Parsec as well, which was why I went that route. I think I did see something about a lazy many parsing combinator somewhere, I'll see if I can dig it up.

As I've mentioned, the StringLit parsing I wrote has some problems because it reads escaped characters as multiple characters. In my opinion, it definitely should read the literal character, which is the behaviour of the current parser. The question is what to do if an invalid string literal is read. For example, there is no such character as '\c'. If you print it the same regardless, something like "ab\c" risks confusion with "ab\\c". I think the StringLit constructor should be applied to a list of some type, of valid stringlit sequences separated by invalid escaped sequences if they appear, and then the invalid escaped sequences can be highlighted in red as an error. Of course, there's still the question of what to do for the no colour mode.

This risks the same issue you were worried about with other show instances doing something weird as we were talking about with respect to numbers, but probably fewer things use a show instance to print out quotes containing non-standard escape characters or single backslashes than printing something that starts with a number. At any rate, it probably is not optimal to try to parse a string and switch out for something else if it fails to be a valid string since strings can potentially be infinite in length.

cdepillabout commented 6 years ago

@andrew-lei Thanks for thinking more about this.

I'm not sure if I have a good answer here.

I had two thoughts:

If we just decided to parse strings strictly, then we could more-or-less just keep the string-parsing code working as-is. We wouldn't be able to use pretty-simple to print out data structures with infinite-length strings, but I've never personally wanted to do that, so I guess I'm okay with that compromise.
We could always just go with whatever is easiest to implement. If it allows pretty-simple to pretty-print lazily, then it is arguably better than the current situation. If we ever decide it doesn't work very good and want to go back to change it, we will probably be in a better situation than we are currently in.

However, I trust your judgement with this, so I'm definitely willing to go with whatever you deem is the best course of action.

andrew-lei commented 6 years ago

I think if we parsed string literals lazily, we could design some post-processing step that takes a lazy list of string literals and invalid escape sequences that allows configuration for unusual cases. On the other hand, I don't think it would be possible to go the other way if we start out by parsing string literals strictly.

Incidentally, this is what I was thinking of doing for numbers, which is part of why I don't have a separate constructor for numbers in my current fork (the other reason being that I wanted to keep the tests correct).

We could also just go with leaving the escaping in. After all, show doesn't give you literal newlines, so it would be reasonable for pShow to do the same. And we could include an optional strict post-processing step for this as well, which would be pretty simple; just read :: String -> String.

cdepillabout commented 6 years ago

I think if we parsed string literals lazily, we could design some post-processing step that takes a lazy list of string literals and invalid escape sequences that allows configuration for unusual cases. On the other hand, I don't think it would be possible to go the other way if we start out by parsing string literals strictly.

Yeah, this sounds good to me!

We could also just go with leaving the escaping in. After all, show doesn't give you literal newlines, so it would be reasonable for pShow to do the same. And we could include an optional strict post-processing step for this as well, which would be pretty simple; just read :: String -> String.

Ideally, I think I would like the default pShow to print out newlines in strings.

This is because pretty-simple is generally used to make large data structures easier to read. I think that actually printing the newlines in a String or Text would make them easier to read (as opposed to just printing the \n). For instance, if you pretty-printed a data structure that had a list of blog posts, I think it would be easiest to read and understand if the blog posts actually had newlines.

Thinking about the end-users, I think the benefit from actually printing the newlines in Strings is greater than the benefit of printing strings lazily. What do you think about this tradeoff?

However, I definitely agree that making this one of the configuration options would be really nice. Users that want it to work differently could easily do that!

andrew-lei commented 6 years ago

Well, andrew-lei@a8c00fb makes expressionsToOutputs lazy, unfortunately about half the tests are broken. pShow is still not lazy yet, so the last composed function, render must be holding it up. I'll try to fix the tests before I move onto that. Proof of laziness:

> take 100 . show . expressionsToOutputs . expressionParse . show $ repeat 'a'
"[Output {outputNestLevel = NestLevel {unNestLevel = 0}, outputOutputType = OutputStringLit \"aaaaaaaa"

andrew-lei commented 6 years ago

I believe runReader requires the Reader to be run to completion before returning anything, so it seems like it will not be suitable here, at least not together with the monadic fold.

Perhaps we can extract a function that returns the next bit, either as a MonadReader OutputOptions m => m Char or MonadReader OutputOptions m => m Text. Then runReader on that, and fold over the results outside of the monad.

EDIT: Or perhaps keep the Builders and run the fold on the Builders outside of the monad.

andrew-lei commented 6 years ago

andrew-lei@b341085 makes render lazy

Proof of laziness:

> import Data.Text.Lazy as T
> T.take 100 . pShowNoColor $ [1..]
"[ 1\n, 2\n, 3\n, 4\n, 5\n, 6\n, 7\n, 8\n, 9\n, 10\n, 11\n, 12\n, 13\n, 14\n, 15\n, 16\n, 17\n, 18\n, 19\n, 20\n, 21\n, 22"

No additional tests have been broken, currently at 47/50; the stringlit issue must still be resolved. I'll probably be able to get that done either today or tomorrow, and submit a pull request then.

cdepillabout commented 6 years ago

That sounds great! Thanks again for working on this.

andrew-lei commented 6 years ago

So it's actually pretty simple to solve these last tests, throw in a read for String and append quotes to the start and end. This isn't lazy, however. I could lift the code from base (Text.Read.Lex) in order to do this, but I was thinking of ways to have this set up as a configurable option.

As I've mentioned, I think it would work well as part of a post-processing step that would allow a user to perform more general sorts of manipulation on the intermediate bits. So the post-processor should be a function either [Output] -> [Output] or [Expr] -> [Expr]. I'm leaning towards the latter because I think it would allow for more general manipulation of the structure (for instance, to hide everything in certain Parens).

I was thinking one way to do this would be to perhaps have a constructor that contains the parsed value and either the type information (string literal, number, datetime, etc.) or the colour.

Anyway, this would require a bit more of a redesign than is strictly necessary for laziness (plus, I think it would make sense to open it as a separate issue so we can discuss this further), so I think perhaps this should wait until later, and in the meantime I'll just use readMaybe with default if Nothing as just the string without interpreting the escaping.

cdepillabout commented 6 years ago

@andrew-lei

So it's actually pretty simple to solve these last tests...

I think I see what you did there ;-)

throw in a read for String and append quotes to the start and end. This isn't lazy, however. I could lift the code from base (Text.Read.Lex) in order to do this, but I was thinking of ways to have this set up as a configurable option.

As I've mentioned, I think it would work well as part of a post-processing step that would allow a user to perform more general sorts of manipulation on the intermediate bits. So the post-processor should be a function either [Output] -> [Output] or [Expr] -> [Expr]. I'm leaning towards the latter because I think it would allow for more general manipulation of the structure (for instance, to hide everything in certain Parens).

I think both of these ideas sound really good.

It would be great to have a post-processing step that would allow a user a way to perform arbitrary manipulation on the output. This would allow pretty-simple to be used in quite a general manner to Haskell types in many different ways.

Anyway, this would require a bit more of a redesign than is strictly necessary for laziness (plus, I think it would make sense to open it as a separate issue so we can discuss this further), so I think perhaps this should wait until later, and in the meantime I'll just use readMaybe with default if Nothing as just the string without interpreting the escaping.

This sounds like a good plan. It would be great to get a release out with the work you've done so far!

cdepillabout / pretty-simple

parser/printer could be more lazy (to handle prefix of infinite stream) #9