Create eep-0062.md: String interpolation syntax

erlang / eep

Erlang Enhancement Proposals

http://www.erlang.org/erlang-enhancement-proposals/

264 stars 66 forks source link

Create eep-0062.md: String interpolation syntax #45

Closed TD5 closed 9 months ago

TD5 commented 1 year ago

Reference implementation: https://github.com/erlang/otp/pull/7343

josevalim commented 1 year ago

Hi @TD5! Awesome to see these discussions being carried forward (plus a reference implementation 😍).

Interpolation syntax

I agree #{...} is a confusing choice for Erlang. However, I dislike the choice of ~ for two reasons:

The opening (~) and closing terminators (~) are the same, which makes the string harder to parse for humans
Today, I can write this code: io:format(<<"José is ~.1f-years-old"/utf8>>, [98.7]). If I want to convert it to the new nicer syntax but continue using format, I now have to escape the ~ character

I would suggest going with ${...} or something similar. $ is already about characters in Erlang (so it is slightly related) and it seems to be a common syntax according to Rosetta Code.

What is interpolatable?

The proposal recognizes that there different audiences for strings: human and developers. However, because Erlang lacks an extensibility mechanism, those choices need to be made upfront. For example, imagine I have a "decimal" library for working with arbitrary precision number on base 10. Said values have a human representation in them, so what happens when I do:

bf"José has ~Decimal~EUR"

Decimal will most likely be a record. If we format it as a record, the value won't be useful for humans. Furthermore, formatting it as a record can be a security issue, because I can accidentally interpolate a composite value with sensitive information.

The other thing is that you can very well have a mixture of both human and developer representation in the same string. For example, take this code:

try
  Expr
catch
  Kind:Reason:Stack ->
    logger:info(bd"got ~Kind~ with ~Reason~\n~format_stacktrace(Stack)~")
end.

In the example above, Kind is an atom, so it doesn't matter, but I want Reason as a term and Stack as human. So I don't think the upfront delimiters are flexible enough?

Given that:

What can be interpolated is not customizable/extensible
We want to distinguish between user and developer
And Erlang already has an specification that addresses all of these cases in io:format

I would suggest to:

The default binary interpolation expects binaries exclusively
We should support formats within the new syntax as the way to pretty print other data types

There is a very similar discussion happening in Java and that's where they landed:

String table = FMT."""
    Description     Width    Height     Area
    %-12s\{zone[0].name}  %7.2f\{zone[0].width}  %7.2f\{zone[0].height}     %7.2f\{zone[0].area()}
    %-12s\{zone[1].name}  %7.2f\{zone[1].width}  %7.2f\{zone[1].height}     %7.2f\{zone[1].area()}
    %-12s\{zone[2].name}  %7.2f\{zone[2].width}  %7.2f\{zone[2].height}     %7.2f\{zone[2].area()}
    \{" ".repeat(28)} Total %7.2f\{zone[0].area() + zone[1].area() + zone[2].area()}
    """;

It is worth reading their proposal and motivations too, because I feel many of them apply to Erlang. If we want, we could clear it up a bit by using ~{...} for interpolation with an optional format:

b"Foo ~{ThisIsBinary}"
b"Integer ~b{...}"

The list modifiers

The list modifiers are tricky because, when you interpolate, are you interpolating an iolist, iodata, charlist, or chardata? And even if we pick one or the other, are we going to traverse the list to validate it? Or are we going to leave this up to users?

Elixir forces them to be a list of characters but that requires validating the list on every concatenation (expensive). At least they are hardly used in Elixir. So it may be worth considering if we really want a list modifier?

TD5 commented 1 year ago

Thanks @josevalim for your thoughts and feedback!

Syntax

The opening (\~) and closing terminators (\~) are the same, which makes the string harder to parse for humans

I am not entirely sure that's true - strings use double-quotes for opening and closing the syntax, but I don't think they're typically considered hard to read as a result.

Today, I can write this code: io:format(<<"José is ~.1f-years-old"/utf8>>, [98.7]). If I want to convert it to the new nicer syntax but continue using format, I now have to escape the ~ character

I definitely agree with this point, however, it presumably be true for any syntax we use, unless it's not currently a valid character in a string (which doesn't leave us many options!). For example, if we used ${...} for interpolation, now I'd need to escape by bash script binary io:format(<<"#!/bin/bash\nX = ${Y}"/utf8>>, [Var]).. I think this applies in principle to almost any syntax choice.

From this perspective, I think the precise syntax is largely a matter of taste, and a question of how each of us personally uses Erlang and what sort of strings we typically make. As such, I'd probably lean towards making the syntax whatever will make the OTP team most happy to merge the change.

What is interpolatable?

However, because Erlang lacks an extensibility mechanism, those choices need to be made upfront. [...] Decimal will most likely be a record. If we format it as a record, the value won't be useful for humans. Furthermore, formatting it as a record can be a security issue, because I can accidentally interpolate a composite value with sensitive information.

I agree with this. The proposal offers two suggestions:

Factor out the formatting into a dedicated function which can be tested and reused, e.g. bf"José has ~decimal:fmt(Decimal)~EUR". Functions already offer a means for abstraction and reuse, so to me it makes sense to leverage them where flexibility is required
- We don't rule out adding more advanced formatting à la io:format in the future, but since it's tricky to get formatting modifiers right, and what we have here is already a somewhat non-trivial addition to the language that users will need time to acclimatise to, land on idioms as a community, and so on, I suggest delaying deciding whether/how to do formatting modifiers to a subsequent EEP & implementation. I think, from a technical perspective, adding formatting modifiers later should the straight-forward enough.

List modifiers

The list modifiers are tricky because, when you interpolate, are you interpolating an iolist, iodata, charlist, or chardata?

If we follow my suggestion above and avoid custom formatting modifiers, then we end up in a bit of a sweet-spot and get a natural solution to this conundrum for free:

For bf"...", the "rule" is that we only accept values which are string-y: iolist, iodata, charlist, chardata, etc. In that case, we attempt to print them as a string, and throw an exception if it transpires the value actually isn't a valid string-y type
For bd"...", we always format the list without any magic, so if it's a nested list of characters, we print a nested list of characters and don't try to infer whether it's intended to be a string or not.

josevalim commented 1 year ago

I am not entirely sure that's true - strings use double-quotes for opening and closing the syntax, but I don't think they're typically considered hard to read as a result.

Good point! I guess once we get syntax highlighting, it is fine.

I definitely agree with this point, however, it presumably be true for any syntax we use

Correct, any choice we make will require escaping, but there are certainly choices that we can make that are less frequent in present and future Erlang code. My concern is exactly in picking up a character that already has a meaning in Erlang. For example, your choice of using ${...} is less likely to occur because it requires two characters, ${, in order to activate it. We could even pick ~{...} or ~~ and those are by definition less likely to occur in an string than a single ~.

I suggest delaying deciding whether/how to do formatting modifiers to a subsequent EEP & implementation.

I am afraid we can't fully postpone deciding it (but we can postpone implementing it). Let's suppose we ship this feature as is using ~...~ for escaping. Then, if we choose to support ~B{...} in the future for formatting (or whatever syntax) later on, we can potentially break any string using the combination above. So if we may want to support formatting, we should at least discuss possible syntaxes and reserve them, even if we don't precisely decide or implement them right now.

For bf"...", the "rule" is that we only accept values which are string-y: iolist, iodata, charlist, chardata, etc.

Unfortunately we cannot support both iolist+iodata and charlist+chardata. That's because the integers in a list are ambiguous: [233] will lead to different binaries depending on one vs the other. So you need to know which one you have and how you will encode it.

Plus there is another issue: if you have a iolist, iodata, charlist, chardata, then you most likely do not want to create a binary and instead write [b"my binary" | CharData] instead. This is somewhat an issue in Elixir, where iodata/chardata are not as prevalent as we would like because it is easy to fall into the habit of string interpolation / string concatenation.

I can think of two answers here:

Only accept binaries in interpolation, as per above
Related to #46: introduce two modifiers then: one for binaries b"..." (iodata/iolist driven) and another one for unicode u"..." (chardata/charlist driven)? But then does it mean we have uf, ud, bf, bd, lf, ld?

essen commented 1 year ago

I'm against interpolation in general, and highly against interpolation like done here. The core problem is that in Erlang we don't know whether functions have side effects or not and so it is not possible to restrict the functions called during interpolation to side-effect free functions. This can lead to all sorts of funny problems that end up very time consuming to debug. Therefore my time is better spent avoiding interpolation entirely and I tend to log only variables to avoid any issues.

Interpolation restricted to variables defined previously is far less contentious but this would make the proposal little more than a new formatter (that uses variables instead of implicitly numbered placeholders, but lacks formatting options) along with some syntax sugar on top.

I also question the usefulness of this considering we already have iolists. I can already give an iolist to logger:error and other functions. The interpolation doesn't bring much to the table on top of that since it doesn't know how to format variables if they're not a string or equivalent.

TD5 commented 1 year ago

@essen, thanks for your input!

it is not possible to restrict the functions called during interpolation to side-effect free functions. This can lead to all sorts of funny problems that end up very time consuming to debug

Could you elaborate on this? Logically, I see string interpolation as a short-hand for a function that builds the interpolated string, e.g.:

X = bf"My name is ~Name~ and my age is ~get_age()~"

is not too dissimilar to:

X = build_binary_format("My name is ", Name, " and my age is ", get_age())

And, indeed, the reference implementation provided largely works like that.

To me, then, string interpolation doesn't seem to change the debugging story, because we already have the issue of side-effects in arguments to function calls, etc.

I also question the usefulness of this considering we already have iolists.

I think I see what you mean here. I think the usefulness depends on how you use Erlang. For example, you may want to use binaries extensively for their memory-compactness, canonical representation (so equivalent strings are structurally equal, where as with iolists etc, the nesting can differ), etc., then interpolation is potentially very valuable. Performance profiles can also vary a lot: with the latest binary optimisations in OTP, and their use under-the-hood in the reference implementation for string interpolation, I think for certain workloads you could get very good performance for working with strings which might otherwise get quite clumsy and verbose.

The interpolation doesn't bring much to the table on top of that since it doesn't know how to format variables if they're not a string or equivalent

I am not sure what you mean here. If you don't have a string, you can use the bd"..." syntax to format the string in the usual Erlang-y way, and if you want to customise the rendering, you can use a function call. Importantly, though, the call would be inline with the string rather than separate, e.g.:

logger:error("~s Line: ~B, Column: ~B, ~ts", [Time, Line, Col, format_msg(Error)])

vs.

logger:error(bf"~Time~ Line: ~Line~, Column: ~Col~, ~format_msg(Error)~")

One real-world example I see a lot is a bug in the above code that looks like this:

logger:error("~s Line: ~B, Column: ~B, ~ts", [Time, Col, Line, format_msg(Error)])

Here, Line and Col have been accidentally swapped. The code runs, but gives subtly incorrect output. In the interpolated string situation, the bug is a bit easier to spot, imo:

logger:error(bf"~Time~ Line: ~Col~, Column: ~Line~, ~format_msg(Error)~")

I certainly think it's fair to say the value of the feature depends on your use cases, and perhaps it wouldn't particularly suit your needs, but hopefully it wouldn't hinder you, and the net benefit to the Erlang userbase would make it a worthwhile addition to the language.

TD5 commented 1 year ago

Regarding the precise syntax of ~...~ vs. ${...} - I am not super opinionated on the specifics, so I'd be happy to update the EEP and reference implementation if there was a clear consensus on an alternative. ${...} seems like a perfectly good choice to me.

Unfortunately we cannot support both iolist+iodata and charlist+chardata. That's because the integers in a list are ambiguous: [233] will lead to different binaries depending on one vs the other. So you need to know which one you have and how you will encode it.

Ah, what I meant around this is that for each formatting mode, we only ever interpret values in one way. For bf"...", it's as a list of unicode codepoints, and for bd"..." it's as a list of numbers (and if that interpretation fails, we throw).

In this case:

bf"~[233]~" -> "é"
bd"~[233]~" -> "[233]"

Plus there is another issue: if you have a iolist, iodata, charlist, chardata, then you most likely do not want to create a binary

I think this is another case where it depends on your usage. For my usage, I often do want a binary for reasons I discussed elsewhere in this thread (memory compactness, performance and using a single, simple type consistently everywhere). I think it's true that the existing stdlib consumes iolist etc conveniently (it'll accept any of them), but when it produces a value that you consume, it's tricker because ideally there would only be one representation to worry about. Instead, I see people calling lists:flatten(...) and similar on the result of stdlib calls.

essen commented 1 year ago

To me, then, string interpolation doesn't seem to change the debugging story, because we already have the issue of side-effects in arguments to function calls, etc.

Yes, but it allows hiding function calls (and especially functions with side effects) into otherwise innocuous log lines. You are not typically looking at the log lines in the code when debugging, and it can take a while to figure out that there lies the problem. If the function calls are in the log message itself, it's even easier to miss them.

I think I see what you mean here. I think the usefulness depends on how you use Erlang. For example, you may want to use binaries extensively for their memory-compactness, canonical representation (so equivalent strings are structurally equal, where as with iolists etc, the nesting can differ), etc., then interpolation is potentially very valuable. Performance profiles can also vary a lot: with the latest binary optimisations in OTP, and their use under-the-hood in the reference implementation for string interpolation, I think for certain workloads you could get very good performance for working with strings which might otherwise get quite clumsy and verbose.

I doubt that because in order to produce your one binary you have to produce the intermediate values that you will append to this binary, and therefore generate a bunch of garbage. This garbage will eventually get GC and that's where the performance cost will be. It's the same as just starting from an iolist and converting it to binary. Perhaps you can get tiny improvements here or there, but you could get them in a iolist_to_binary(InterpolationEquivalent) as well.

The interpolation doesn't bring much to the table on top of that since it doesn't know how to format variables if they're not a string or equivalent

I am not sure what you mean here.

I think you've showed in your examples what I meant. The format string was doing a type check and lacking a named variable (and could do better formatting), while the interpolation is using named variables (and embedded function calls) and lacking type checks. But you still have to call that format function yourself, within the interpolation string or not.

Note that I'm not against named variables in format strings, it's the one thing that's missing in the way Erlang does string formatting. But perhaps it would be better to improve format strings to allow using named variables (taken either from the scope or from a map/proplist).

I certainly think it's fair to say the value of the feature depends on your use cases, and perhaps it wouldn't particularly suit your needs, but hopefully it wouldn't hinder you, and the net benefit to the Erlang userbase would make it a worthwhile addition to the language.

Afraid there's no way it can't hinder me because I still have to read and debug code other people wrote.

TD5 commented 1 year ago

I doubt that because in order to produce your one binary you have to produce the intermediate values that you will append to this binary, and therefore generate a bunch of garbage

The reference implementation attempts to avoid this by leveraging the optimisation where a binary with one consumer is mutated in-place rather than allocating again. For example, see this section of code where we make sure to always append to a binary with a single owner.

It may be that my reference implementation doesn't make use of the optimisation correctly, but I think in principle, the design in this EEP does not necessarily require that the implementation generate a lot of garbage.

zuiderkwast commented 1 year ago

It's good if binary strings, multiline strings and interpolation can be orthogonal features, independent of each other. I wouldn't want one to imply another.

If we add b"" for binary strings and bf"" for binary with interpolation, we should also add f"" for regular strings (charlists) with interpolation. Even if backticks are chosen for binary strings, we could have the f prefix added to these too: f`My binary name is {name}` or with multiline strings e.g. fb"""...""".

A motivation for the f prefix on string literals is Python's f-strings f"My name is {name}" from PEP 489 which is exactly this. (IMO we can also follow Python's choice of curly braces for interpolation which looks nice and simple.)

If we add other features in the future, like regex literals, it can be combined with other prefixes like rf"^{name}$" for regex with interpolation.

TD5 commented 1 year ago

The reference implementation attempts to avoid this by leveraging the optimisation where a binary with one consumer is mutated in-place rather than allocating again. For example, see this section of code where we make sure to always append to a binary with a single owner.

I should probably add to this that, in my experience, iolists end up being quite slow and memory intensive since they produce a lot of garbage, especially when the binary elements are small, since the overhead of the references in each cons cell can be large relative to the actual binary data being stored. In contrast, binaries (especially when used in such a way that is amenable to OTP's new optimisations) can be lightning fast by virtue of avoiding a vast number of allocations, and garbage collected allocations in particular.

kikofernandez commented 1 year ago

Thanks for this contribution.

At the moment, this PR involves many small decisions that need to be consistent and well-thought, e.g., symbol for sigils, syntax, and its implementation may add complexity to the lexer and parser.

I do not think we can figure all these details immediately, before OTP-27. The OTP team is taking small steps figuring out these design decisions.

Thanks for your contribution. We will keep you posted in a timely manner

mikpe commented 11 months ago

I have some issues with this proposal.

In an interpolated string lf"foo~E~bar", is E scanned as a fragment of the enclosing string (or binary) or not? This affects whether nested string literals need quoting or not. From discussions elsewhere I guess the leading and trailing parts of the surrounding string are scanned essentially like opening and closing brackets containing some text, and E is scanned using the scanner's start state, but this PR doesn't actually say that.
It talks about permitting "user" or "developer" oriented formatting of terms, but it never defines what those actually are. It needs to define them in terms of pre-existing Erlang I/O formatting primitives, ideally io_lib:format.
Disallowing any but these two options is too naive. It'll force many use cases to have to wrap the interpolated expressions with local formatting calls, which effectively turns the top-level interpolated string to an expression concatenating string literals and locally constructed strings. We already have that in the language.
The notation for producing an interpolated binary using a magic marker and a string literal is odd. Why not just say f<<"foo ~E~ bar">>? I don't see the need for the l or b markers.
The meaning of an interpolated string (or binary) should just be an equivalent expression in the Erlang base language. This PR should IMO clearly specify what that is. (The implementation may want to use new or hidden primitives, but the model needs to only refer to the base language.)
The motivation for this solution is weak, only talking vaguely about "making constructing compound strings more readable". What exact problem does constructing compound strings have currently? What options are there for addressing that problem? Why was this particular solution chosen? For starters, I believe that named parameters in io_lib:format calls needs to be investigated.

RaimoNiskanen commented 11 months ago

It talks about permitting "user" or "developer" oriented formatting of terms, but it never defines what those actually are. It needs to define them in terms of pre-existing Erlang I/O formatting primitives, ideally io_lib:format.

It might be that what we have in io_lib:format does not fit the use case good enough. The format characters we have are for printing integer, float, char, string, or any term. This EEP wants to have a "user" format for the ones with a "simple" presentation format: integer, float, string, and atom.

It is possible to add helper functions in io_lib if we should find this distinction useful.

Disallowing any but these two options is too naive. It'll force many use cases to have to wrap the interpolated expressions with local formatting calls, which effectively turns the top-level interpolated string to an expression concatenating string literals and locally constructed strings. We already have that in the language.

Given the variables A, B, C, and the local formatting function s/1, we have:

["Lorem ipsum ", s(A), ", ", s(B), ", ", s(C), ";"]

There are a lot of "fly droppings" there. Compare to a simple Bash interpolation, assuming that we do not need a formatting helper:

"Lorem ipsum $A, $B, $C;"

And we can augment this with a notation for local formatting helper:

"Lorem ipsum ${A/s}, ${B/s}, ${C/s};"

Not having to write all the I/O list punctuation increases readability, in my opinion.

The notation for producing an interpolated binary using a magic marker and a string literal is odd. Why not just say f<<"foo ~E~ bar">>? I don't see the need for the l or b markers.

In the binary syntax a /utf8 marker is also needed, so compare f<<"foo~E~bar"/utf8>> to b"foo~E~bar". I think this boils down to getting a lean way to write binary string literals to make them the new go-to string type.

TD5 commented 11 months ago

I think this boils down to getting a lean way to write binary string literals to make them the new go-to string type.

This is a core motivation (and then we add corresponding syntax for list-strings, for easy backward compatibility). Binaries have the benefit of being compact (no need to have a pointer and a tag for each character; characters are laid out contiguously in memory). Your point about "fly droppings" it key, too, since string templates are, like it or not, very common in real-world code in my experience, and the syntactic noise from these things can quickly hide even trivial formatting errors. Moreover, the new syntax makes some errors essentially unrepresentable, for example, there's not really an equivalent of an io_lib:format call where the format string arity doesn't align with the argument list length.

kikofernandez commented 9 months ago

I have accepted the PR so that it shows in the EEP list.

Conversations may continue here if desired.