BranchTaken / Hemlock

Programming language
Other
30 stars 3 forks source link

Design/implement formatted printing #111

Closed jasone closed 2 years ago

jasone commented 3 years ago

The Format API is powerful, but quite difficult to use correctly. Both Format and Printf are quite complicated to implement from a type safety perspective, and format string ergonomics are less than optimal. Furthermore, the APIs make effect tracking extremely unwieldy. Design and implement a formatted printing API that cleanly supports custom formatting of inputs, effects tracking, file/stream/string output modes, ideally without excessive cognitive load on the programmer using the facility. This facility may best be implemented with explicit syntax support for format strings, in which case it won't be feasible to use in the bootstrap Basis library. But if no special syntax is required, favor implementing this in the bootstrap Basis in order to ease the OCaml -> Hemlock rewrite step of bootstrapping.

Tasks:

jasone commented 2 years ago

There are two parts to this design:

1) Formatted I/O API (mandatory) 2) Syntax for formatted strings (optional)

API

We need to support the possibility of effectless formatting, specifically so that we can use formmatters to create strings without causing side effects. In contrast, OCaml's Printf.fprintf %a/%t specifiers support formatted printing only via effectful functions. The API we want requires that state and behavior be encapsulated; first-class modules provide a clean solution. What follows is written in Hemlock so that the effects semantics are explicit, but the API can be stripped of effect typing and implemented in OCaml. The Fmt.Formatter module type provides the state/behavior encapsulation.

# Fmt.hmi ----------------------------------------------------------------------
type Formatter: Formatter (e: effect) =
    type t
    val fmt: string -> t >e-> t

Any given type can implement a fmt function similar to the following:

type t

# val to_string: t -> string

# val fmt >e: t -> Fmt.Formatter e >e-> Fmt.Formatter e
let fmt t (formatter >e: Fmt.Formatter e) =
    formatter |> Fmt.Formatter.fmt (to_string t)

The key point to note is that fmt has a parametric effect that depends on the formatter, but fmt is otherwise effectless. Thus fmt can be used in conjunction with a string/file/whatever formatter without regard for formatter effects.

Following is a sketch of the APIs needed to support formatted strings and I/O. The syntax design proposed thereafter depends upon this API.

# Fmt.hmi ----------------------------------------------------------------------
type Formatter: Formatter (e: effect) =
    type t
    val fmt: string -> t >e-> t

type just =
| Left
| Center
| Right

type sign =
| Implicit
| Explicit
| Space

type base =
| Bin
| Oct
| Dec
| Hex

type notation =
| ExpE
| ExpP
| Compact

val pad_default: codepoint
val just_default: just
val sign_default: sign
val alt_default: bool
val zpad_default: bool
val width_default: uns
val precision_default: uns
val base_default: base
val notation_default: notation

# Fmt.hm -----------------------------------------------------------------------
let pad_default = ' '
let just_default = Right
let sign_default = Implicit
let alt_default = false
let zpad_default = false
let width_default = 0
let precision_default = 2
let base_default = Dec
let notation_default = Compact

# Uns.hmi ----------------------------------------------------------------------
type uns

val to_string: ?pad:codepoint -> ?just:Fmt.just -> ?sign:Fmt.sign -> ?alt:bool
  -> ?zpad:bool -> ?width:uns -> ?base:Fmt.base
  -> uns -> string
val fmt >e: ?pad:codepoint -> ?just:Fmt.just -> ?sign:Fmt.sign -> ?alt:bool
  -> ?zpad:bool -> ?width:uns -> ?base:Fmt.base
  -> uns -> Fmt.Formatter e >e-> Fmt.Formatter e

# String.hmi -------------------------------------------------------------------
type string

val to_string: ?pad:codepoint -> ?just:Fmt.just -> ?alt:bool -> ?width:uns
  -> string -> string
val fmt >e: ?pad:codepoint -> ?just:Fmt.just -> ?alt:bool -> ?width:uns
  -> string -> Fmt.Formatter e >e-> Fmt.Formatter e

val Fmt:
    type Formatter: Formatter =
      Fmt.Formatter {|} with type t := t

    val empty: Formatter
    val finish: Formatter -> string

    val to_string >e: (Fmt.Formatter e -> Fmt.Formatter e) -> string

# String.hm --------------------------------------------------------------------
let to_string ?(pad=Fmt.pad_default) ?(just=Fmt.just_default)
  ?(alt:Fmt.alt_default) ?(width=Fmt.width_default) s =
    ...

let fmt (>e:effect) ?pad ?just ?alt ?width s (formatter:Fmt.Formatter e) =
    formatter |> Fmt.Formatter.fmt (to_string ?pad ?just ?alt ?width s)

let Fmt = {|
    type t = string list

    let empty = []

    let finish t =
        String.concat_rev t

    let to_string fmt_partial =
        empty |> fmt_partial |> finish
  |}

# File.hmi ---------------------------------------------------------------------
type file

val Fmt:
    type Formatter: Formatter =
      Fmt.Formatter os with type t := t

    val makeFormatter: file -> Formatter
    val flush: Formatter >os-> Formatter

Formatted strings

Assume the following lead-in code for all the subsequent formatted string syntax examples.

let print s =
    File.write_hlt (Bytes.of_string_slice (String.C.Slice.of_string s)) s File.stdout
let name = "Bob"
let age = 42
let children = ["Alice"; "Charlie"; "Megan"]
let children_fmt (>e:effect) children (formatter:Fmt.Formatter e) =
    let rec fn children formatter =
        match children with
        | [] -> ()
        | child_m :: child_n :: [] ->
            formatter
              |> String.fmt child_m
              |> String.fmt " and "
              |> String.fmt child_n
        | child :: children ->
            formatter
              |> String.fmt child
              |> String.fmt ", "
              |> fn children
    fn children formatter

Here's how the formatter API could be used to produce a string:

print (String.Fmt.empty
  |> String.fmt "Hello, "
  |> String.fmt name
  |> String.fmt ", your advanced (binary) age of "
  |> Uns.fmt ~alt:true ~base:Bin age
  |> String.fmt " is excessive.\n\
Next year you will be "
  |> Uns.fmt (succ age)
  |> String.fmt " years old.
But wow, your progeny - "
  |> children_fmt children
  |> String.fmt " - are cute."
  |> String.Fmt.finish
  )

# Prints:
# `|Hello, Bob, your advanced (binary) age of 0b11010 is excessive.
#  |Next year you will be 43 years old.
#  |But wow, your progeny - Alice, Charlie and Megan - are cute.
# `

Proposed syntax

OCaml's format specifiers have the unfortunate quality that they look like strings, but they are augmented with additional type information so that calls to printf-like functions can be statically typed. This requires rather complex types in the printf machinery, but the bigger problem is that the dual nature of strings becomes apparent if the application tries to use a format string reference rather than a string literal as the format specifier to printf. The syntax proposals that follow intentionally embed parameters directly into the string literals, so that it is syntactically impossible to disjoin a format specifier from its parameters.

print "Hello, %(name), your advanced (binary) age of %#bu(age) is excessive.\n\
Next year you will be %(succ age) years old.
But wow, your progeny - %(children_fmt children) - are cute."

This reads well, but it's quite tricky to parse, and it requires inferring types to inform desugaring. The parsing challenge arises because arbitrary code in %(...) can contain nested parens, strings, etc., and we cannot in general recognize the matching ) without parsing the code. That requires a nested expression parser context distinct from the outer parser, and parser nesting can recurse arbitrarily deep in the general case. Although this is a solvable parsing problem, it demands sophistication of external tooling, e.g. syntax highlighting in text editors, that is too onerous to seriously contemplate.

Consider what can be done just with the infix ^ string concatenation operator:

print ("Hello, "^name^", your advanced (binary) age of "
  ^(Uns.to_string ~alt:true ~base:Bin age)) ^" is excessive.\n\
Next year you will be "^(Uns.to_string (succ age))^" years old.
But wow, your progeny - "^(String.Fmt.to_string (children_fmt children))
  ^" - are cute.")

The code is quite succinct for name, but it gets pretty bulky otherwise. Still, this isn't a ridiculous option. Now consider the following syntax:

print "Hello, %s(^name^), your advanced (binary) age of %#bu(^age^) is excessive.\n\
Next year you will be %u(^succ age^) years old.
But wow, your progeny - %f(^children_fmt children^) - are cute."

This syntax can be desugared without type inference because the necessary type information is explicit in the format string, and this can be parsed without resorting to nested parsers, because the %(^...^) delimiters are distinct relative to every other syntactic construct in Hemlock. Furthermore the syntax evokes a synergy between formatted string syntax and string concatenation. In fact, the following mix is valid (note how name is concatenated into the result):

print ("Hello, "^name^", your advanced (binary) age of %#bu(^age^) is excessive.\n\
Next year you will be %u(^succ age^) years old.
But wow, your progeny - %f(^children_fmt children^) - are cute.")

Semantics

The formatted string syntax has some significant restrictions with regard to how it formats values:

Format specifiers

A format specifier can be embedded as %<specifier>(^...^). Some specifier options apply only to a subset of types, e.g. zero padding (<zpad>) only applies to numeric types.

%['<pad>'][<just>][<sign>][<alt>][<zpad>][<width>][.<precision>][<base>][<notation>][<type abbr>(^...^)

Note that some options like <zpad> cannot be used universally, e.g. as %0s(^"hello"^), and such malformed specifiers will cause compilation failure. Similarly any options specified to a partially applied formatter function will cause compilation failure unless the formatter function supports the options.

Examples:

"%s(^name^)"
"%#xu(^age^)"
"%u(^succ age^)"
"%<50f(^List.fmt String.fmt children^)"
"%<50f(^ List.fmt String.fmt (List.mapi children ~f:(fun i child ->
    "%u(^succ i^):%s(^child^)"
  ))^)"
"%f(^(List.fmt String.fmt) children^)"
"%#016xu(^some_uns^)"
"%'␠'<*(^page_width^)s(^title^)\n"
"%'␠'^s(^title^)\n"
"(%'*'98s(^""^))" # Length-100 "(**...**)" string.
"%#xu(^x^) > %#xu(^y^) -> %b(^x > y^)"
"%#bu(^x^) %#ou(^x^) %#u(^x^) %#xu(^x^)"
"\%" # Need to escape; could use "%%` instead, but "\%" feels more correct.
jasone commented 2 years ago

The above design went through two major revisions:

1) %a/%t were merged into a single case, and partially applied formatters were to be type-inferred in the same way as the supported atomic types. 2) Explicit type abbreviations were added to specifiers. This allows desugaring as early as the parsing phase, rather than relying upon type inference to determine which formatter functions to call in the desugared form.

jasone commented 2 years ago

I'm reducing the "implementation" scope of this issue to exclude desugaring and native Hemlock implementation, and creating a separate issue (#158) for refactoring the bootstrap Basis to remove the Format dependency. That leaves the following items:

[migrated to top-level description]

jasone commented 2 years ago

There are some subtleties with regard to indentation of code nested within interpolated strings.

I've explored some of the use cases for formatted strings using an Array cursor test. The first version below is the Format-based form; Fmt-based forms follow.

Format-based:

    let open Format
    let rec fn arr hd cursor tl =
        let index = Cursor.index cursor
        printf "index=%a" Uns.pp index
        printf ", container %a arr"
          Cmp.pp (cmp Uns.cmp (Cursor.container cursor) arr)
        let hd_cursor = Cursor.cmp hd cursor
        printf ", hd %a cursor" Cmp.pp hd_cursor
        let cursor_tl = Cursor.cmp cursor tl
        printf ", cursor %a tl" Cmp.pp cursor_tl
        let () = match hd_cursor with
            | Lt -> printf ", lget=%a" Uns.pp (Cursor.lget cursor)
            | Eq -> printf ", lget=_"
            | Gt -> not_reached ()
        let () = match cursor_tl with
            | Lt -> printf ", rget=%a" Uns.pp (Cursor.rget cursor)
            | Eq -> printf ", rget=_"
            | Gt -> not_reached ()
        printf "\n"

        let length = length arr
        assert (Cursor.(=) (Cursor.seek (Uns.to_sint index) hd) cursor)
        assert (Cursor.(=) hd (Cursor.seek (-(Uns.to_sint index)) cursor))
        assert (Cursor.(=) (Cursor.seek (Uns.to_sint (length - index)) cursor) tl)
        assert (Cursor.(=) cursor (Cursor.seek (-(Uns.to_sint (length - index))) tl))

        match cursor_tl with
        | Lt ->
            let cursor' = Cursor.succ cursor
            assert Cursor.(cursor = (pred cursor'))
            fn arr hd cursor' tl
        | Eq | Gt -> ()
    let arrs = [
        [||]
        [|0|]
        [|0; 1|]
        [|0; 1; 2|]
      ]
    printf "@[<h>"
    List.iter arrs ~f:(fun arr ->
        printf "--- %a ---\n" (pp Uns.pp) arr
        let hd = Cursor.hd arr
        fn arr hd hd (Cursor.tl arr)
      )
    printf "@]"

Fmt-based (dense):

    let rec fn arr hd cursor tl =
        let index = Cursor.index cursor
        let hd_cursor = Cursor.cmp hd cursor
        let cursor_tl = Cursor.cmp cursor tl
        let _ = File.Fmt.stdout |> String.fmt "\
index=%u(^index^), \
container %f(^Cmp.fmt (cmp Uns.cmp (Cursor.container cursor))^) arr, \
hd %f(^Cmp.fmt hd_cursor ^)cursor, \
cursor %f(^Cmp.fmt cursor_tl^) tl, \
lget=%s(^match hd_cursor with Lt -> "%u(^Cursor.lget cursor^)" | Eq -> "_" | Gt -> not_reached ()^), \
rget=%s(^match cursor_tl with Lt -> "%u(^Cursor.rget cursor^)" | Eq -> "_" | Gt -> not_reached ()^)\n"
        let length = length arr
        assert (Cursor.(=) (Cursor.seek (Uns.to_sint index) hd) cursor)
        assert (Cursor.(=) hd (Cursor.seek (-(Uns.to_sint index)) cursor))
        assert (Cursor.(=) (Cursor.seek (Uns.to_sint (length - index)) cursor) tl)
        assert (Cursor.(=) cursor (Cursor.seek (-(Uns.to_sint (length - index))) tl))

        match cursor_tl with
        | Lt ->
            let cursor' = Cursor.succ cursor
            assert Cursor.(cursor = (pred cursor'))
            fn arr hd cursor' tl
        | Eq | Gt -> ()
    let arrs = [
        [||]
        [|0|]
        [|0; 1|]
        [|0; 1; 2|]
      ]
    List.iter arrs ~f:(fun arr ->
        let _ = File.Fmt.stdout |> String.fmt "--- %f(^fmt Uns.fmt arr^) ---"
        let hd = Cursor.hd arr
        fn arr hd hd (Cursor.tl arr)
      )
    let _ = File.Fmt.(flush stdout)

Fmt-based snippet (heavy use of indentation):

        let _ = File.Fmt.stdout |> String.fmt
          "index=%u(^
            index
          ^), container %f(^
            Cmp.fmt (cmp Uns.cmp (Cursor.container cursor))
          ^) arr, hd %f(^\
            Cmp.fmt hd_cursor
          ^) cursor, cursor %f(^
            Cmp.fmt cursor_tl
          ^) tl, lget=%s(^
            match hd_cursor with
            | Lt -> "%u(^
                Cursor.lget cursor
              ^)"
            | Eq -> "_"
            | Gt -> not_reached ()
          ^), rget=%s(^
            match cursor_tl with
            | Lt -> "%u(^
                Cursor.rget cursor
              ^)"
            | Eq -> "_"
            | Gt -> not_reached ()
          ^)\n"

Fmt-based snippet (hybrid of formatted strings and direct formatter calls):

        let _ = File.Fmt.stdout
          |> String.fmt "index=%u(^index^), container "
          |> Cmp.fmt (cmp Uns.cmp (Cursor.container cursor))
          |> String.fmt " arr, hd %f(^Cmp.fmt hd_cursor^) cursor, cursor %f(^Cmp.fmt cursor_tl^) tl"
          |> String.fmt
            match hd_cursor with
            | Lt -> ", lget=%u(^Cursor.lget cursor^)"
            | Eq -> ", lget=_"
            | Gt -> not_reached ()
          |> String.fmt
            match cursor_tl with
            | Lt -> ", rget=%u(^Cursor.rget cursor^)"
            | Eq -> ", rget=_"
            | Gt -> not_reached ()
          |> String.fmt "\n"
jasone commented 2 years ago

Regarding tokenization of formatted strings, we're going to have to add some statefulness to the scanner, because there may be more than two components to each string. Furthermore, because formatted strings can nest, we need a stack of states. The following example shows how the tokens combine position {left,inner,right} and kind of nested code {width,precision,value}.

"...%*(^width^).*(^precision^)r(^value^)...%6.*(^precision^)r(^value2^)...%u(^value3^)..."
~~~~~~~~     ~~~~~~         ~~~~~     ~~~~~~~~~~~         ~~~~~      ~~~~~~~~~      ~~~~~~
lwfstring    pfstring       vfstring  ipfistring          vfstring   ivfstring      rfstring

Format string token sequences conform to paths through a deterministic finite automaton (DFA), with start/accept states and transitions as indicated.

jasone commented 2 years ago

A few tidbits related to recent progress:

jasone commented 2 years ago

Real formatting is a deep topic that has consumed quite a bit of time over the past several days. I have normalized formatting implemented now, though the decimal output does not promise the same precision/stability as does strfromd(3). I don't think this is a huge deal, since we provide bit-accurate base-2/8/16 formatting. That said, someday we may want to precisely match the formatting implemented in other languages. Doing so will require multi-word integer math, for which the nat type will be ideal (once it's integrated).

For future reference, the following documents how to convert to/from decimal format:

Correctly Rounded Binary-Decimal and Decimal-Binary Conversions
David M. Gay
AT&T Bell Labs
Numerical Analysis Manuscript 90-10
November 30, 1990
jasone commented 2 years ago

Bytes.Slice.fmt would be trivial to implement were Array.Slice available. Rather than hacking a solution, I'm going to tackle #90 as a prerequisite.

jasone commented 2 years ago

Yesterday I ran into a problem with the planned interface for Formattable. The original idea was to just have each type supply a fmt function and integrate that into Formatttable but that has two problems:

My initial plan was to add an xfmt for types which support optional formatting parameters, but the String.fmt vs String.xfmt situation violates POLA. After further thought I decided to have a pp function and a fmt function, where pp serves the "pretty printing" role for Identifiable types. That's going to require an intermediate renaming step to move the Format-based pp functions to xpp so that there is no name collision. On the bright side, it will be trivial to spot remaining uses of the old pretty printing code via git grep xpp.

jasone commented 2 years ago

It became clear while using ZInt.fmt to format the exponent for Realer.pp that pretty-printing the suffix needs to be controlled via a separate parameter. I added the pretty parameter, refactored the numeric types to use it, and enhanced String.fmt to additionally support formatting using the raw string syntax.

jasone commented 2 years ago

I ended up enhancing the compact-notation real formatting after dealing with some poor API ergonomics late in the test refactoring. This issue is now fully implemented aside from the scanner enhancements.

jasone commented 2 years ago

While reviewing the diffs I determined that the way I was using precision and notation to control whether trailing zeros are omitted was overly complex, and that the right solution is to add pmode (precision mode) which is either Limited or Fixed. I got that implemented, but during test refactoring I discovered that there are some previously unnoticed bugs in RadixPoint formatting for at least Hex. It's probably going to be a couple days until I can get that fixed.

jasone commented 2 years ago

I split the *fstring task into a separate issue (#217).