AndreVanDelft / scala

The SubScript extension to the Scala programming language
http://www.subscript-lang.org/
12 stars 1 forks source link

Challenge: match Parboiled2 result value expressiveness #81

Open AndreVanDelft opened 9 years ago

AndreVanDelft commented 9 years ago

Parboiled2 is quite expressive for result values; with help of Shapeless.

E.g. the following lines are extracted from the CSV parser example:

  case class CsvFile(header: Option[Record], records: immutable.Seq[Record])
  case class Record(fields: immutable.Seq[String])

    val file = rule {
      OWS ~ optional(test(ctx.headerPresent) ~ header ~ NL) ~
        oneOrMore(record).separatedBy(NL) ~ optional(NL) ~ EOI ~> CsvFile
    }

    val header = rule { record }

    val record = rule { oneOrMore(field).separatedBy(ctx.fieldDelimiter) ~> Record }

    val NL = rule { optional('\r') ~ '\n' }

    val OWS = rule { zeroOrMore(' ') }

I have no idea how ~> CsvFile gets two parameters for the construction of a CsvFile; one from the optional header and one from the one-or-more records.

Anyway, it is a challenge to improve SubScript's expressiveness so that it would understand a similar parser specification. If no result values had been required the specification would simply be:

file = ows; if (ctx.headerPresent) header NL; record .. NL; .NL; EOI

However, result values are needed. The main challenge is to hand the two parameters to the constructor of CsvFile. Suppose we would write:

file = OWS 
       optional: [header^ NL] ^1
       oneOrMore: record, separatedBy: NL ^2
       NL; EOI
       ~~~> CsvFile

^1 and 2 would imply that the result of the enclosing script (not file here, but a lambda, because of the arrow), must be a tupel with 2 fields. This tuple is filled in 2 steps. It is then transferred using the arrow, and handed as 2 parameters to CsvFile. But as long as I don't understand the Parboiled2 code, I cannot be more specific here.

Scripts such as optional could probably be defined as:

script..

  success[T](v:T) = {: v :} (+)

  optional(r: Script[R]): Option[R] = success(None) + r^{Option(_)}

  zeroOrMore:separatedBy[R](r: Script[R], s: Script[_]): Seq[R] = {:Nil:}^; .; r^{$ :+ _}..s
   oneOrMore:separatedBy[R](r: Script[R], s: Script[_]): Seq[R] = {:Nil:}^;    r^{$ :+ _}..s

Note that r^{Option(_)} sets the result of the enclosing script to the result of the call to r, wrapped in an Option.

There would be more issues, but it would be good to have a proper explanation of the Parboiled2 code first.

BTW the script success may look a bit weird. It starts with a tiny code fragment that sets the result value. This code fragment behaves successfully, but success should act like a 1. Therefore the tiny code fragment is put in a sequential composition with (+) i.e. the 1 element. Maybe a shorter notation would be needed; then we may think of

  success[T](v:T) = {+ v +}

and

  failure(e:Throwable) = {- throw e -}
anatoliykmetyuk commented 9 years ago

I have no idea how ~> CsvFile gets two parameters for the construction of a CsvFile; one from the optional header and one from the one-or-more records.

This rule:

OWS ~ optional(test(ctx.headerPresent) ~ header ~ NL) ~
        oneOrMore(record).separatedBy(NL) ~ optional(NL) ~ EOI

must have a type of Option[Record] :: immutable.Seq[Record], where :: is a thing Shapeless constructs lists of types with. This is so, because only header and record are of type Rule1[_] there, everything else is of type Rule0. Their behaviour on successful match differs: Rule0 does nothing and Rule1[T] pushes the matched value of type T to the value stack. So we can't use the value matched by Rule0, but we can use value matched by Rule1[_] in future.

When two rules of type Rule1[A] and Rule1[B] are concatenated by the ~ operator, the resulting rule is of type Rule1[A :: B] - meaning it will push a list of values (A and B) to the stack.

I don't know how ~> is implemented internally, but the intuition is as follows. The lhs of the ~> is a rule and the rhs is a function that transforms the values outputted by this rule. The result is a rule of the same type as the rhs's output. On compile time, the ~> looks at the left-hand side rule's type - which is a list of types of the values that are pushed by it to the value stack. It'll then typecheck the right-hand side operand to be a function arguments of which are of same types as the values pushed to the stack by the left-hand side rule. Since rules are macros, operations with types must be easy. If everything's OK, ~> produces a new rule. This rule, on runtime, will invoke the ~>'s lhs rule to do a match for it, then pop its matched values from the value stack, transform them with its rhs function and push the result to the stack as its own result.

I don't think SubScript's implementation will be similar to Parboiled2. Parboiled2 parses linearly, hence it is possible to have a value stack and safely work with it. SubScript traverses its call graph in much more complicated way - so we can't use the value stack.

I think, if we're intended to use SubScript for parsing, we should create some parsing tests-challenges for it, and also a module that will turn SubScript into a parser. So that we can see how it performs as we add new features.