Closed jasone closed 2 years ago
There are two parts to this design:
1) Formatted I/O API (mandatory) 2) Syntax for formatted strings (optional)
We need to support the possibility of effectless formatting, specifically so that we can use formmatters to create strings without causing side effects. In contrast, OCaml's Printf.fprintf %a
/%t
specifiers support formatted printing only via effectful functions. The API we want requires that state and behavior be encapsulated; first-class modules provide a clean solution. What follows is written in Hemlock so that the effects semantics are explicit, but the API can be stripped of effect typing and implemented in OCaml.
The Fmt.Formatter
module type provides the state/behavior encapsulation.
# Fmt.hmi ----------------------------------------------------------------------
type Formatter: Formatter (e: effect) =
type t
val fmt: string -> t >e-> t
Any given type can implement a fmt
function similar to the following:
type t
# val to_string: t -> string
# val fmt >e: t -> Fmt.Formatter e >e-> Fmt.Formatter e
let fmt t (formatter >e: Fmt.Formatter e) =
formatter |> Fmt.Formatter.fmt (to_string t)
The key point to note is that fmt
has a parametric effect that depends on the formatter, but fmt
is otherwise effectless. Thus fmt
can be used in conjunction with a string/file/whatever formatter without regard for formatter effects.
Following is a sketch of the APIs needed to support formatted strings and I/O. The syntax design proposed thereafter depends upon this API.
# Fmt.hmi ----------------------------------------------------------------------
type Formatter: Formatter (e: effect) =
type t
val fmt: string -> t >e-> t
type just =
| Left
| Center
| Right
type sign =
| Implicit
| Explicit
| Space
type base =
| Bin
| Oct
| Dec
| Hex
type notation =
| ExpE
| ExpP
| Compact
val pad_default: codepoint
val just_default: just
val sign_default: sign
val alt_default: bool
val zpad_default: bool
val width_default: uns
val precision_default: uns
val base_default: base
val notation_default: notation
# Fmt.hm -----------------------------------------------------------------------
let pad_default = ' '
let just_default = Right
let sign_default = Implicit
let alt_default = false
let zpad_default = false
let width_default = 0
let precision_default = 2
let base_default = Dec
let notation_default = Compact
# Uns.hmi ----------------------------------------------------------------------
type uns
val to_string: ?pad:codepoint -> ?just:Fmt.just -> ?sign:Fmt.sign -> ?alt:bool
-> ?zpad:bool -> ?width:uns -> ?base:Fmt.base
-> uns -> string
val fmt >e: ?pad:codepoint -> ?just:Fmt.just -> ?sign:Fmt.sign -> ?alt:bool
-> ?zpad:bool -> ?width:uns -> ?base:Fmt.base
-> uns -> Fmt.Formatter e >e-> Fmt.Formatter e
# String.hmi -------------------------------------------------------------------
type string
val to_string: ?pad:codepoint -> ?just:Fmt.just -> ?alt:bool -> ?width:uns
-> string -> string
val fmt >e: ?pad:codepoint -> ?just:Fmt.just -> ?alt:bool -> ?width:uns
-> string -> Fmt.Formatter e >e-> Fmt.Formatter e
val Fmt:
type Formatter: Formatter =
Fmt.Formatter {|} with type t := t
val empty: Formatter
val finish: Formatter -> string
val to_string >e: (Fmt.Formatter e -> Fmt.Formatter e) -> string
# String.hm --------------------------------------------------------------------
let to_string ?(pad=Fmt.pad_default) ?(just=Fmt.just_default)
?(alt:Fmt.alt_default) ?(width=Fmt.width_default) s =
...
let fmt (>e:effect) ?pad ?just ?alt ?width s (formatter:Fmt.Formatter e) =
formatter |> Fmt.Formatter.fmt (to_string ?pad ?just ?alt ?width s)
let Fmt = {|
type t = string list
let empty = []
let finish t =
String.concat_rev t
let to_string fmt_partial =
empty |> fmt_partial |> finish
|}
# File.hmi ---------------------------------------------------------------------
type file
val Fmt:
type Formatter: Formatter =
Fmt.Formatter os with type t := t
val makeFormatter: file -> Formatter
val flush: Formatter >os-> Formatter
Assume the following lead-in code for all the subsequent formatted string syntax examples.
let print s =
File.write_hlt (Bytes.of_string_slice (String.C.Slice.of_string s)) s File.stdout
let name = "Bob"
let age = 42
let children = ["Alice"; "Charlie"; "Megan"]
let children_fmt (>e:effect) children (formatter:Fmt.Formatter e) =
let rec fn children formatter =
match children with
| [] -> ()
| child_m :: child_n :: [] ->
formatter
|> String.fmt child_m
|> String.fmt " and "
|> String.fmt child_n
| child :: children ->
formatter
|> String.fmt child
|> String.fmt ", "
|> fn children
fn children formatter
Here's how the formatter API could be used to produce a string:
print (String.Fmt.empty
|> String.fmt "Hello, "
|> String.fmt name
|> String.fmt ", your advanced (binary) age of "
|> Uns.fmt ~alt:true ~base:Bin age
|> String.fmt " is excessive.\n\
Next year you will be "
|> Uns.fmt (succ age)
|> String.fmt " years old.
But wow, your progeny - "
|> children_fmt children
|> String.fmt " - are cute."
|> String.Fmt.finish
)
# Prints:
# `|Hello, Bob, your advanced (binary) age of 0b11010 is excessive.
# |Next year you will be 43 years old.
# |But wow, your progeny - Alice, Charlie and Megan - are cute.
# `
OCaml's format specifiers have the unfortunate quality that they look like strings, but they are augmented with additional type information so that calls to printf
-like functions can be statically typed. This requires rather complex types in the printf
machinery, but the bigger problem is that the dual nature of strings becomes apparent if the application tries to use a format string reference rather than a string literal as the format specifier to printf
. The syntax proposals that follow intentionally embed parameters directly into the string literals, so that it is syntactically impossible to disjoin a format specifier from its parameters.
print "Hello, %(name), your advanced (binary) age of %#bu(age) is excessive.\n\
Next year you will be %(succ age) years old.
But wow, your progeny - %(children_fmt children) - are cute."
This reads well, but it's quite tricky to parse, and it requires inferring types to inform desugaring. The parsing challenge arises because arbitrary code in %(...)
can contain nested parens, strings, etc., and we cannot in general recognize the matching )
without parsing the code. That requires a nested expression parser context distinct from the outer parser, and parser nesting can recurse arbitrarily deep in the general case. Although this is a solvable parsing problem, it demands sophistication of external tooling, e.g. syntax highlighting in text editors, that is too onerous to seriously contemplate.
Consider what can be done just with the infix ^
string concatenation operator:
print ("Hello, "^name^", your advanced (binary) age of "
^(Uns.to_string ~alt:true ~base:Bin age)) ^" is excessive.\n\
Next year you will be "^(Uns.to_string (succ age))^" years old.
But wow, your progeny - "^(String.Fmt.to_string (children_fmt children))
^" - are cute.")
The code is quite succinct for name
, but it gets pretty bulky otherwise. Still, this isn't a ridiculous option. Now consider the following syntax:
print "Hello, %s(^name^), your advanced (binary) age of %#bu(^age^) is excessive.\n\
Next year you will be %u(^succ age^) years old.
But wow, your progeny - %f(^children_fmt children^) - are cute."
This syntax can be desugared without type inference because the necessary type information is explicit in the format string, and this can be parsed without resorting to nested parsers, because the %(^...^)
delimiters are distinct relative to every other syntactic construct in Hemlock. Furthermore the syntax evokes a synergy between formatted string syntax and string concatenation. In fact, the following mix is valid (note how name
is concatenated into the result):
print ("Hello, "^name^", your advanced (binary) age of %#bu(^age^) is excessive.\n\
Next year you will be %u(^succ age^) years old.
But wow, your progeny - %f(^children_fmt children^) - are cute.")
The formatted string syntax has some significant restrictions with regard to how it formats values:
fmt
functions are valid values. Support for common formatting options (e.g. padding and numerical base) differs by type:
bool
uns
, int
, u8
, nat
, real
, etc.)codepoint
string
(>e:effect) -> Fmt.Formatter e >e-> Fmt.Formatter e
(specified as e.g. %xf(^Uns.fmt ~alt:true 42^)
)List
, Map
, etc.) provide parametrized fmt
functions which can be composed and used via partial application and/or nesting.A format specifier can be embedded as %<specifier>(^...^)
. Some specifier options apply only to a subset of types, e.g. zero padding (<zpad>
) only applies to numeric types.
%['<pad>'][<just>][<sign>][<alt>][<zpad>][<width>][.<precision>][<base>][<notation>][<type abbr>(^...^)
'<pad>'
: Pad with specified codepoint (default: ' '
)<just>'
: Justify
<
: Left^
: Center>
: Right (default)<sign>
: Sign (default is to print sign only when negative)
+
: Explicit sign, even when positive_
: Space in place of sign if sign is non-negative<alt>
: #
enables alternate formatting
()
_
0b
prefix, groups of 80o
prefix, groups of 30x
prefix, groups of 4'c'
``some string``
<zpad>
: Pad numeric type with leading zeros. This is distinct from <pad>
because zero padding goes between the sign/prefix and the digits.<width>
: Width
42
: Fixed width in codepoints*
: Parametric width in codepoints, e.g. %*(^width^)u(^some_u^)
.<precision>
: Number of digits past the point
.2
: Fixed precision.*
: Parametric precision, e.g. %*(^width^).*(^precision^)r(^some_r^)
<base>
: Numerical base
b
: Binary (integer/real)o
: Octal (integer/real)d
: Decimal (integer/real) (default base)x
: Hexadecimal (integer/real)<notation>
: Real-specific notation
e
: Real in decimal exponential notationp
: Real in binary floating point notation (bp
, op
, xp
*)g
: Real in (non-)exponential notation, whichever is more compact (bg
, og
, dg
*, xg
)<type abbr>
: Type abbreviation
b
: bool
(%b(^...^)
is unambiguous with respect to e.g. binary-formatted uns
— %bu(^...^)
)u
, u64
, i8
, z
, n
, r
, etc.: Numeric type of corresponding literal suffixc
: codepoint
s
: string
f
: Partially applied formatter of type (>e:effect) -> Fmt.Formatter e >e-> Fmt.Formatter e
Note that some options like <zpad>
cannot be used universally, e.g. as %0s(^"hello"^)
, and such malformed specifiers will cause compilation failure. Similarly any options specified to a partially applied formatter function will cause compilation failure unless the formatter function supports the options.
Examples:
"%s(^name^)"
"%#xu(^age^)"
"%u(^succ age^)"
"%<50f(^List.fmt String.fmt children^)"
"%<50f(^ List.fmt String.fmt (List.mapi children ~f:(fun i child ->
"%u(^succ i^):%s(^child^)"
))^)"
"%f(^(List.fmt String.fmt) children^)"
"%#016xu(^some_uns^)"
"%'␠'<*(^page_width^)s(^title^)\n"
"%'␠'^s(^title^)\n"
"(%'*'98s(^""^))" # Length-100 "(**...**)" string.
"%#xu(^x^) > %#xu(^y^) -> %b(^x > y^)"
"%#bu(^x^) %#ou(^x^) %#u(^x^) %#xu(^x^)"
"\%" # Need to escape; could use "%%` instead, but "\%" feels more correct.
The above design went through two major revisions:
1) %a
/%t
were merged into a single case, and partially applied formatters were to be type-inferred in the same way as the supported atomic types.
2) Explicit type abbreviations were added to specifiers. This allows desugaring as early as the parsing phase, rather than relying upon type inference to determine which formatter functions to call in the desugared form.
I'm reducing the "implementation" scope of this issue to exclude desugaring and native Hemlock implementation, and creating a separate issue (#158) for refactoring the bootstrap Basis
to remove the Format
dependency. That leaves the following items:
[migrated to top-level description]
There are some subtleties with regard to indentation of code nested within interpolated strings.
I've explored some of the use cases for formatted strings using an Array
cursor test. The first version below is the Format
-based form; Fmt
-based forms follow.
Format
-based:
let open Format
let rec fn arr hd cursor tl =
let index = Cursor.index cursor
printf "index=%a" Uns.pp index
printf ", container %a arr"
Cmp.pp (cmp Uns.cmp (Cursor.container cursor) arr)
let hd_cursor = Cursor.cmp hd cursor
printf ", hd %a cursor" Cmp.pp hd_cursor
let cursor_tl = Cursor.cmp cursor tl
printf ", cursor %a tl" Cmp.pp cursor_tl
let () = match hd_cursor with
| Lt -> printf ", lget=%a" Uns.pp (Cursor.lget cursor)
| Eq -> printf ", lget=_"
| Gt -> not_reached ()
let () = match cursor_tl with
| Lt -> printf ", rget=%a" Uns.pp (Cursor.rget cursor)
| Eq -> printf ", rget=_"
| Gt -> not_reached ()
printf "\n"
let length = length arr
assert (Cursor.(=) (Cursor.seek (Uns.to_sint index) hd) cursor)
assert (Cursor.(=) hd (Cursor.seek (-(Uns.to_sint index)) cursor))
assert (Cursor.(=) (Cursor.seek (Uns.to_sint (length - index)) cursor) tl)
assert (Cursor.(=) cursor (Cursor.seek (-(Uns.to_sint (length - index))) tl))
match cursor_tl with
| Lt ->
let cursor' = Cursor.succ cursor
assert Cursor.(cursor = (pred cursor'))
fn arr hd cursor' tl
| Eq | Gt -> ()
let arrs = [
[||]
[|0|]
[|0; 1|]
[|0; 1; 2|]
]
printf "@[<h>"
List.iter arrs ~f:(fun arr ->
printf "--- %a ---\n" (pp Uns.pp) arr
let hd = Cursor.hd arr
fn arr hd hd (Cursor.tl arr)
)
printf "@]"
Fmt
-based (dense):
let rec fn arr hd cursor tl =
let index = Cursor.index cursor
let hd_cursor = Cursor.cmp hd cursor
let cursor_tl = Cursor.cmp cursor tl
let _ = File.Fmt.stdout |> String.fmt "\
index=%u(^index^), \
container %f(^Cmp.fmt (cmp Uns.cmp (Cursor.container cursor))^) arr, \
hd %f(^Cmp.fmt hd_cursor ^)cursor, \
cursor %f(^Cmp.fmt cursor_tl^) tl, \
lget=%s(^match hd_cursor with Lt -> "%u(^Cursor.lget cursor^)" | Eq -> "_" | Gt -> not_reached ()^), \
rget=%s(^match cursor_tl with Lt -> "%u(^Cursor.rget cursor^)" | Eq -> "_" | Gt -> not_reached ()^)\n"
let length = length arr
assert (Cursor.(=) (Cursor.seek (Uns.to_sint index) hd) cursor)
assert (Cursor.(=) hd (Cursor.seek (-(Uns.to_sint index)) cursor))
assert (Cursor.(=) (Cursor.seek (Uns.to_sint (length - index)) cursor) tl)
assert (Cursor.(=) cursor (Cursor.seek (-(Uns.to_sint (length - index))) tl))
match cursor_tl with
| Lt ->
let cursor' = Cursor.succ cursor
assert Cursor.(cursor = (pred cursor'))
fn arr hd cursor' tl
| Eq | Gt -> ()
let arrs = [
[||]
[|0|]
[|0; 1|]
[|0; 1; 2|]
]
List.iter arrs ~f:(fun arr ->
let _ = File.Fmt.stdout |> String.fmt "--- %f(^fmt Uns.fmt arr^) ---"
let hd = Cursor.hd arr
fn arr hd hd (Cursor.tl arr)
)
let _ = File.Fmt.(flush stdout)
Fmt
-based snippet (heavy use of indentation):
let _ = File.Fmt.stdout |> String.fmt
"index=%u(^
index
^), container %f(^
Cmp.fmt (cmp Uns.cmp (Cursor.container cursor))
^) arr, hd %f(^\
Cmp.fmt hd_cursor
^) cursor, cursor %f(^
Cmp.fmt cursor_tl
^) tl, lget=%s(^
match hd_cursor with
| Lt -> "%u(^
Cursor.lget cursor
^)"
| Eq -> "_"
| Gt -> not_reached ()
^), rget=%s(^
match cursor_tl with
| Lt -> "%u(^
Cursor.rget cursor
^)"
| Eq -> "_"
| Gt -> not_reached ()
^)\n"
Fmt
-based snippet (hybrid of formatted strings and direct formatter calls):
let _ = File.Fmt.stdout
|> String.fmt "index=%u(^index^), container "
|> Cmp.fmt (cmp Uns.cmp (Cursor.container cursor))
|> String.fmt " arr, hd %f(^Cmp.fmt hd_cursor^) cursor, cursor %f(^Cmp.fmt cursor_tl^) tl"
|> String.fmt
match hd_cursor with
| Lt -> ", lget=%u(^Cursor.lget cursor^)"
| Eq -> ", lget=_"
| Gt -> not_reached ()
|> String.fmt
match cursor_tl with
| Lt -> ", rget=%u(^Cursor.rget cursor^)"
| Eq -> ", rget=_"
| Gt -> not_reached ()
|> String.fmt "\n"
Regarding tokenization of formatted strings, we're going to have to add some statefulness to the scanner, because there may be more than two components to each string. Furthermore, because formatted strings can nest, we need a stack of states. The following example shows how the tokens combine position {left,inner,right} and kind of nested code {width,precision,value}.
"...%*(^width^).*(^precision^)r(^value^)...%6.*(^precision^)r(^value2^)...%u(^value3^)..."
~~~~~~~~ ~~~~~~ ~~~~~ ~~~~~~~~~~~ ~~~~~ ~~~~~~~~~ ~~~~~~
lwfstring pfstring vfstring ipfistring vfstring ivfstring rfstring
Format string token sequences conform to paths through a deterministic finite automaton (DFA), with start/accept states and transitions as indicated.
lvfstring
→ {iwfstring
, ipfstring
, ivfstring
, rfstring
}lpfstring
→{vfstring
}lwfstring
→ {pfstring
, vfstring
}iwfstring
→{pfstring
, vfstring
}ipfstring
→{vfstring
}ivfstring
→{iwfstring
, ipfstring
, ivfstring
, rfstring
}pfstring
→ {vfstring
}vfstring
→ {iwfstring
, ipfstring
, ivfstring
, rfstring
}rfstring
(accept)A few tidbits related to recent progress:
There were some minor inaccuracies in how I imagined the APIs working with regard to first-class modules. The crux of it is that the module signature needs to include an opaque type t
and a state
accessor.
(** Formatter module type, used both for string and file output. *)
module type Formatter = sig
type t
(** Abstract formatter type. *)
val state: t
(** Opaque state. *)
val fmt: string -> t -> t
(** [fmt s t] produces a new state which incorporates [s] into [t]. *)
val sync: t -> t synced
(** [sync t] synchronizes any pending formatting. String formatters produce [To_string string],
whereas file formatters flush any output buffers and produce [Synced t]. *)
end
Uns.fmt
generates over 200 KiB of expected output, which is too voluminous to commit to the repo, especially considering that several other types will have similarly large fmt
test output. @cevans87 suggested hashing the output, and making it possible to toggle verbose output for initial validation and later regression analysis."Some list: %#4f(^List.fmt Uns.fmt us^)"
to specify that the list's primary indentation is 4, i.e. that the elements should be indented 8 spaces. Nested formatters can of course specify progressively larger indentation, and parametric width (e.g. "%*(^indentation^)f(^...^)"
) can even be used in a composable fashion when needed.Real
formatting is a deep topic that has consumed quite a bit of time over the past several days. I have normalized formatting implemented now, though the decimal output does not promise the same precision/stability as does strfromd(3)
. I don't think this is a huge deal, since we provide bit-accurate base-2/8/16 formatting. That said, someday we may want to precisely match the formatting implemented in other languages. Doing so will require multi-word integer math, for which the nat
type will be ideal (once it's integrated).
For future reference, the following documents how to convert to/from decimal format:
Correctly Rounded Binary-Decimal and Decimal-Binary Conversions
David M. Gay
AT&T Bell Labs
Numerical Analysis Manuscript 90-10
November 30, 1990
Bytes.Slice.fmt
would be trivial to implement were Array.Slice
available. Rather than hacking a solution, I'm going to tackle #90 as a prerequisite.
Yesterday I ran into a problem with the planned interface for Formattable
. The original idea was to just have each type supply a fmt
function and integrate that into Formatttable
but that has two problems:
Uns.fmt
make the functions incompatible with the interface.String.pp
, have non-default parametrizations with respect to fmt
.My initial plan was to add an xfmt
for types which support optional formatting parameters, but the String.fmt
vs String.xfmt
situation violates POLA. After further thought I decided to have a pp
function and a fmt
function, where pp
serves the "pretty printing" role for Identifiable
types. That's going to require an intermediate renaming step to move the Format
-based pp
functions to xpp
so that there is no name collision. On the bright side, it will be trivial to spot remaining uses of the old pretty printing code via git grep xpp
.
It became clear while using ZInt.fmt
to format the exponent for Realer.pp
that pretty-printing the suffix needs to be controlled via a separate parameter. I added the pretty
parameter, refactored the numeric types to use it, and enhanced String.fmt
to additionally support formatting using the raw string syntax.
I ended up enhancing the compact-notation real
formatting after dealing with some poor API ergonomics late in the test refactoring. This issue is now fully implemented aside from the scanner enhancements.
While reviewing the diffs I determined that the way I was using precision
and notation
to control whether trailing zeros are omitted was overly complex, and that the right solution is to add pmode
(precision mode) which is either Limited
or Fixed
. I got that implemented, but during test refactoring I discovered that there are some previously unnoticed bugs in RadixPoint
formatting for at least Hex
. It's probably going to be a couple days until I can get that fixed.
I split the *fstring
task into a separate issue (#217).
The
Format
API is powerful, but quite difficult to use correctly. BothFormat
andPrintf
are quite complicated to implement from a type safety perspective, and format string ergonomics are less than optimal. Furthermore, the APIs make effect tracking extremely unwieldy. Design and implement a formatted printing API that cleanly supports custom formatting of inputs, effects tracking, file/stream/string output modes, ideally without excessive cognitive load on the programmer using the facility. This facility may best be implemented with explicit syntax support for format strings, in which case it won't be feasible to use in the bootstrapBasis
library. But if no special syntax is required, favor implementing this in the bootstrapBasis
in order to ease the OCaml -> Hemlock rewrite step of bootstrapping.Tasks:
\\%
protectionFmt
module.String.Fmt
.File.Fmt
.to_string
andfmt
for all atomic types supported by formatted strings.bool
Uns
Sint
Intnb
-based typesIntw
-based typesreal
codepoint
string
fmt
for container types.list
array
option
result
Ordmap.t
Ordset.t
Map.t
Q.t
Deq.t
Stream.t
fmt
for various types which providepp
.unit
Cmp.t
Sign.t
Real.Dir.t
Real.Class.t
Real.Parts.t
Bytes.Slice.t
Text.Cursor.t
Realer.t
fmt
to interfaces inFormattableIntf
.Format
-relatedpp*
symbols toxpp*
.fmt
inFormattableIntf
topp
.Fmt
rather thanFormat
.xpp
from interfaces inFormattableIntf
.Format
-relatedxpp*
pretty printer functionality.