erlang / eep

Erlang Enhancement Proposals
http://www.erlang.org/erlang-enhancement-proposals/
264 stars 67 forks source link

Add EEP 66: Sigils for String Literals #53

Closed RaimoNiskanen closed 1 year ago

asabil commented 1 year ago

On the topic of extensibility, would it make sense to consider introducing string_transform similar to parse_transform?

RaimoNiskanen commented 1 year ago

@asabil: String transformation can sure be done in a parse transform. Can you sketch how you think a string_transform should work?

RaimoNiskanen commented 1 year ago

I rewrote quite a bit, new proposal:

RaimoNiskanen commented 1 year ago

I stepped back to Sigil Type based verbatim:ness and reduced the scope of Sigil Types and String Delimiters

josevalim commented 1 year ago

I like the reduced scope of the proposal a lot. I have also added a comment about verbatim strings and sigils. Btw, feel free to mark my previous comments as resolved, unfortunately I cannot. :)

RaimoNiskanen commented 1 year ago

I have compared the vanilla sigil ~ with future ~s and ~S, and explained why I think the ~r sigil should be an uncompiled regular expression.

asabil commented 1 year ago

@asabil: String transformation can sure be done in a parse transform. Can you sketch how you think a string_transform should work?

I haven't really given it much thought, but the idea would be to allow something like the code below.

-module(st_test).
-compile({string_transform, sql, sql_transform, []}).

-export([
    test/1
]).

test(Table, ID) ->
    ~sql~"SELECT FROM $[Table] WHERE id = ${ID}".
-module(sql_transform).
-export([
    string_transform/2
]).

-include_lib("syntax_tools/include/merl.hrl").

string_transform(StringNode, _Options) ->
    parse(erl_syntax:string_value(StringNode)),
    {Query, Params} = lists:foldl(fun
        ({param, #{name := ParamName}}, {QueryAcc, ParamAcc}) ->
            ParamValue = merl:var(binary_to_atom(ParamName)),
            {[?Q("<<\"?\">>") | QueryAcc], [?Q("_@ParamValue") | ParamAcc]};
        ({ident, #{name := IdentName}}, {QueryAcc, ParamAcc}) ->
            IdentValue = merl:var(binary_to_atom(ParamName)),
            {[?Q("sql:quote(_@IdentValue)") | QueryAcc], ParamAcc};
        (Node, {QueryAcc, ParamAcc}) when is_binary(Node) ->
            {[?Q("_@Node@") | QueryAcc], ParamAcc}
    end, Template),
    erl_syntax:tuple([
        erl_syntax:list(lists:reverse(Query)),
        erl_syntax:list(lists:reverse(Params))
    ]).

-spec parse(string()) -> [Token] where
     Token :: binary() | {param, #{name := binary()}} | {ident, #{name := binary()}}].
parse(String) ->
   []. %% TODO: implement parsing

Resulting in

1> st_test:test(user, 1).
{[<<"SELECT FROM ">>, <<"\"user\"">>, <<" WHERE id = ?">>], [1]}.

It's maybe nothing more than a simplified way of implementing a parse transform. Not sure how useful that would be?

essen commented 1 year ago

How do sigils play with concatenation? Can you do ~r"\W*((?i)" ++ escape(Word) ++ ~r"(?-i))\W*"? How does that play with potential features like pre compiling the regexps?

RaimoNiskanen commented 1 year ago

How do sigils play with concatenation? Can you do ~r"\W*((?i)" ++ escape(Word) ++ ~r"(?-i))\W*"?

If ~r produces a tuple, then that would not work. But for sigils that produce a string it would. Not for ~S, though since it produces a binary. But [~S"\W*((?i)", escape(Word), ~S"(?-i))\W*"] would be an unicode:charlist() with the right content.

Earlier, when i thought about concatenating sigils it seemed to me that ~x"abc"y ~x"def"y might be possible to combine into ~x"abcdef"y, just like strings are concatenated in the parser, but I guess that for some other future sigil even that is not safe. And with expressions in between the parser cannot do anything.

A regular expression sigil with variable interpolation could be useful in this case, something like: ~rv"\W*((?i)~{Word/escape}(?-i))\W*". That could not be pre-compiled at load time since it cannot be a literal. I guess it could expand to {re,"\\W*((?i)"++escape(Word)++"(?-i))\\W*",""} or {re,[<<"\\W*((?i)"/utf8>>, escape(Word), <<"(?-i))\\W"/utf8>>], ""}

essen commented 1 year ago

How do sigils play with concatenation? Can you do ~r"\W*((?i)" ++ escape(Word) ++ ~r"(?-i))\W*"?

If ~r produces a tuple, then that would not work. But for sigils that produce a string it would. Not for ~S, though since it produces a binary. But [~S"\W*((?i)", escape(Word), ~S"(?-i))\W*"] would be an unicode:charlist() with the right content.

Sounds like sigils are all inconsistent with each other and provide us with many pitfalls.

Earlier, when i thought about concatenating sigils it seemed to me that ~x"abc"y ~x"def"y might be possible to combine into ~x"abcdef"y, just like strings are concatenated in the parser, but I guess that for some other future sigil even that is not safe. And with expressions in between the parser cannot do anything.

It's fine if it's not the parser doing the work, it can be at runtime. But not being able to concatenate ~r sigils is not intuitive since what we see is a string (in the string module sense, so string:concat should work), while under the hood it's a tuple. I expect a few beginner errors.

A regular expression sigil with variable interpolation could be useful in this case, something like: ~rv"\W*((?i)~{Word/escape}(?-i))\W*". That could not be pre-compiled at load time since it cannot be a literal. I guess it could expand to {re,"\\W*((?i)"++escape(Word)++"(?-i))\\W*",""} or {re,[<<"\\W*((?i)"/utf8>>, escape(Word), <<"(?-i))\\W"/utf8>>], ""}

Not a fan of interpolation at all, coming from PHP long ago. Even more so in Erlang with all the Var1, Var2, Var3 shenanigans. That aside, variable interpolation will inevitably lead to some expressions needing additional escaping. But at the same time, if some expressions are written using ~r and others using ~S it's also a problem: now you'll have half the expressions using ~r and half using ~S. But AFAICT that's what it's going to be initially since there won't be interpolation. Not to mention the code debt.

Instead of straight up interpolation I would much prefer that you define the regular expression as ~rv"\W*((?i)~{word/escape}(?-i))\W*" but then have to provide word as argument to the function from re (or other module) as a map #{word => "foo"}. A bit like SQL parameterized queries, with all the pro and cons. This type of input also allows engines to potentially pre compile even if they don't have the variable yet. But this does mean some expressions will need additional escaping.

josevalim commented 1 year ago

I don't think it is adequate to say ~r"foo" is a string and it should behave as a string. In the same way that, 'foo' is not a string and 'foo' ++ 'bar' is not a valid operation. There may be some initial confusion but those should be quickly addressed by stating that different sigils will yield different data types with different operations for them.

Instead of straight up interpolation I would much prefer that you define the regular expression as ~rv"\W((?i)~{word/escape}(?-i))\W" but then have to provide word as argument to the function from re (or other module) as a map #{word => "foo"}.

That's how Python and (and somewhat Java) has historically done this but both are migrating to interpolation based syntax:

I am not advocating in favor of one or the other, but one could look at the on-going discussions for rationale, motivations, etc. Generally speaking there seems to be a push/preference towards interpolation.

essen commented 1 year ago

I don't think it is adequate to say ~r"foo" is a string and it should behave as a string. In the same way that, 'foo' is not a string and 'foo' ++ 'bar' is not a valid operation. There may be some initial confusion but those should be quickly addressed by stating that different sigils will yield different data types with different operations for them.

It's visually a string with ~r in front of it. A single character determines whether you can concatenate or not. And initially it would be the only sigil that you can't handle as a string. It would be more obvious if it was ~r/.../ for example. I don't expect a lot of confusion because regular expressions are rarely used, but this will crash a few programs.

I am not advocating in favor of one or the other, but one could look at the on-going discussions for rationale, motivations, etc. Generally speaking there seems to be a push/preference towards interpolation.

Yes they're generally convenient, especially when it comes to Web development. I suppose I would be fine with them if there was a switch to keep them disabled in 95% of modules. Then in the remaining 5% I can be extra careful about the environment of the functions using interpolation. Otherwise it's just too risky compared to a string templating approach. We could have both too, of course.

RaimoNiskanen commented 1 year ago

Sounds like sigils are all inconsistent with each other and provide us with many pitfalls.

The idea with sigils, as proposed, I'd say, is to transform a string into some term according to some rules. The sigil type defines the created type.

~S creates a binary so it cannot be treated as a string, e.g. ++ does not work. ~s -"- ~C explicitly creates a character list so that is a string, but that is an exception. ~c -"- ~r is suggested to create a 3-tuple.

And ~S"abc" might look like a string with an "operator" ~S prepended, such as not"abc" which is valid syntax and could be a boolean (but it badargs at runtime). My point is that it should not be surprising that an operator may produce a new type.

RaimoNiskanen commented 1 year ago

I changed to propose ~b and ~B to create binaries, and ~s and ~S to produce strings.

RaimoNiskanen commented 1 year ago

I went back to the original set of delimiters, after deciding that it is the sigil type only that affects the tokenizer. For regular expressions it sometimes looks good with / or |, for example.

RaimoNiskanen commented 1 year ago

An internal meeting disagreed with me, so for now we only keep the " string delimiters (and """), and also postpone the ~r sigil.