Closed RaimoNiskanen closed 1 year ago
@asabil: String transformation can sure be done in a parse transform. Can you sketch how you think a string_transform
should work?
I rewrote quite a bit, new proposal:
I stepped back to Sigil Type based verbatim:ness and reduced the scope of Sigil Types and String Delimiters
I like the reduced scope of the proposal a lot. I have also added a comment about verbatim strings and sigils. Btw, feel free to mark my previous comments as resolved, unfortunately I cannot. :)
I have compared the vanilla sigil ~
with future ~s
and ~S
, and explained why I think the ~r
sigil should be an uncompiled regular expression.
@asabil: String transformation can sure be done in a parse transform. Can you sketch how you think a
string_transform
should work?
I haven't really given it much thought, but the idea would be to allow something like the code below.
-module(st_test).
-compile({string_transform, sql, sql_transform, []}).
-export([
test/1
]).
test(Table, ID) ->
~sql~"SELECT FROM $[Table] WHERE id = ${ID}".
-module(sql_transform).
-export([
string_transform/2
]).
-include_lib("syntax_tools/include/merl.hrl").
string_transform(StringNode, _Options) ->
parse(erl_syntax:string_value(StringNode)),
{Query, Params} = lists:foldl(fun
({param, #{name := ParamName}}, {QueryAcc, ParamAcc}) ->
ParamValue = merl:var(binary_to_atom(ParamName)),
{[?Q("<<\"?\">>") | QueryAcc], [?Q("_@ParamValue") | ParamAcc]};
({ident, #{name := IdentName}}, {QueryAcc, ParamAcc}) ->
IdentValue = merl:var(binary_to_atom(ParamName)),
{[?Q("sql:quote(_@IdentValue)") | QueryAcc], ParamAcc};
(Node, {QueryAcc, ParamAcc}) when is_binary(Node) ->
{[?Q("_@Node@") | QueryAcc], ParamAcc}
end, Template),
erl_syntax:tuple([
erl_syntax:list(lists:reverse(Query)),
erl_syntax:list(lists:reverse(Params))
]).
-spec parse(string()) -> [Token] where
Token :: binary() | {param, #{name := binary()}} | {ident, #{name := binary()}}].
parse(String) ->
[]. %% TODO: implement parsing
Resulting in
1> st_test:test(user, 1).
{[<<"SELECT FROM ">>, <<"\"user\"">>, <<" WHERE id = ?">>], [1]}.
It's maybe nothing more than a simplified way of implementing a parse transform. Not sure how useful that would be?
How do sigils play with concatenation? Can you do ~r"\W*((?i)" ++ escape(Word) ++ ~r"(?-i))\W*"
? How does that play with potential features like pre compiling the regexps?
How do sigils play with concatenation? Can you do
~r"\W*((?i)" ++ escape(Word) ++ ~r"(?-i))\W*"
?
If ~r
produces a tuple, then that would not work. But for sigils that produce a string it would. Not for ~S
, though since it produces a binary. But [~S"\W*((?i)", escape(Word), ~S"(?-i))\W*"]
would be an unicode:charlist()
with the right content.
Earlier, when i thought about concatenating sigils it seemed to me that ~x"abc"y ~x"def"y
might be possible to combine into ~x"abcdef"y
, just like strings are concatenated in the parser, but I guess that for some other future sigil even that is not safe. And with expressions in between the parser cannot do anything.
A regular expression sigil with variable interpolation could be useful in this case, something like: ~rv"\W*((?i)~{Word/escape}(?-i))\W*"
.
That could not be pre-compiled at load time since it cannot be a literal. I guess it could expand to {re,"\\W*((?i)"++escape(Word)++"(?-i))\\W*",""}
or {re,[<<"\\W*((?i)"/utf8>>, escape(Word), <<"(?-i))\\W"/utf8>>], ""}
How do sigils play with concatenation? Can you do
~r"\W*((?i)" ++ escape(Word) ++ ~r"(?-i))\W*"
?If
~r
produces a tuple, then that would not work. But for sigils that produce a string it would. Not for~S
, though since it produces a binary. But[~S"\W*((?i)", escape(Word), ~S"(?-i))\W*"]
would be anunicode:charlist()
with the right content.
Sounds like sigils are all inconsistent with each other and provide us with many pitfalls.
Earlier, when i thought about concatenating sigils it seemed to me that
~x"abc"y ~x"def"y
might be possible to combine into~x"abcdef"y
, just like strings are concatenated in the parser, but I guess that for some other future sigil even that is not safe. And with expressions in between the parser cannot do anything.
It's fine if it's not the parser doing the work, it can be at runtime. But not being able to concatenate ~r
sigils is not intuitive since what we see is a string (in the string
module sense, so string:concat
should work), while under the hood it's a tuple. I expect a few beginner errors.
A regular expression sigil with variable interpolation could be useful in this case, something like:
~rv"\W*((?i)~{Word/escape}(?-i))\W*"
. That could not be pre-compiled at load time since it cannot be a literal. I guess it could expand to{re,"\\W*((?i)"++escape(Word)++"(?-i))\\W*",""}
or{re,[<<"\\W*((?i)"/utf8>>, escape(Word), <<"(?-i))\\W"/utf8>>], ""}
Not a fan of interpolation at all, coming from PHP long ago. Even more so in Erlang with all the Var1
, Var2
, Var3
shenanigans. That aside, variable interpolation will inevitably lead to some expressions needing additional escaping. But at the same time, if some expressions are written using ~r
and others using ~S
it's also a problem: now you'll have half the expressions using ~r
and half using ~S
. But AFAICT that's what it's going to be initially since there won't be interpolation. Not to mention the code debt.
Instead of straight up interpolation I would much prefer that you define the regular expression as ~rv"\W*((?i)~{word/escape}(?-i))\W*"
but then have to provide word
as argument to the function from re
(or other module) as a map #{word => "foo"}
. A bit like SQL parameterized queries, with all the pro and cons. This type of input also allows engines to potentially pre compile even if they don't have the variable yet. But this does mean some expressions will need additional escaping.
I don't think it is adequate to say ~r"foo"
is a string and it should behave as a string. In the same way that, 'foo'
is not a string and 'foo' ++ 'bar'
is not a valid operation. There may be some initial confusion but those should be quickly addressed by stating that different sigils will yield different data types with different operations for them.
Instead of straight up interpolation I would much prefer that you define the regular expression as ~rv"\W((?i)~{word/escape}(?-i))\W" but then have to provide word as argument to the function from re (or other module) as a map #{word => "foo"}.
That's how Python and (and somewhat Java) has historically done this but both are migrating to interpolation based syntax:
I am not advocating in favor of one or the other, but one could look at the on-going discussions for rationale, motivations, etc. Generally speaking there seems to be a push/preference towards interpolation.
I don't think it is adequate to say
~r"foo"
is a string and it should behave as a string. In the same way that,'foo'
is not a string and'foo' ++ 'bar'
is not a valid operation. There may be some initial confusion but those should be quickly addressed by stating that different sigils will yield different data types with different operations for them.
It's visually a string with ~r
in front of it. A single character determines whether you can concatenate or not. And initially it would be the only sigil that you can't handle as a string. It would be more obvious if it was ~r/.../
for example. I don't expect a lot of confusion because regular expressions are rarely used, but this will crash a few programs.
I am not advocating in favor of one or the other, but one could look at the on-going discussions for rationale, motivations, etc. Generally speaking there seems to be a push/preference towards interpolation.
Yes they're generally convenient, especially when it comes to Web development. I suppose I would be fine with them if there was a switch to keep them disabled in 95% of modules. Then in the remaining 5% I can be extra careful about the environment of the functions using interpolation. Otherwise it's just too risky compared to a string templating approach. We could have both too, of course.
Sounds like sigils are all inconsistent with each other and provide us with many pitfalls.
The idea with sigils, as proposed, I'd say, is to transform a string into some term according to some rules. The sigil type defines the created type.
~S
creates a binary so it cannot be treated as a string, e.g. ++
does not work.
~s
-"-
~C
explicitly creates a character list so that is a string, but that is an exception.
~c
-"-
~r
is suggested to create a 3-tuple.
And ~S"abc"
might look like a string with an "operator" ~S
prepended, such as not"abc"
which is valid syntax and could be a boolean (but it badargs at runtime). My point is that it should not be surprising that an operator may produce a new type.
I changed to propose ~b
and ~B
to create binaries, and ~s
and ~S
to produce strings.
I went back to the original set of delimiters, after deciding that it is the sigil type only that affects the tokenizer. For regular expressions it sometimes looks good with /
or |
, for example.
An internal meeting disagreed with me, so for now we only keep the "
string delimiters (and """
), and also postpone the ~r
sigil.
On the topic of extensibility, would it make sense to consider introducing
string_transform
similar toparse_transform
?