Closed p8nut closed 1 year ago
It looks like \e
is among the conditionally supported escape sequences. For more see: https://cplusplus.github.io/CWG/issues/505.html
Is there a strong need for this feature?
Yes I am currently working on a project that strongly need that feature.
My only other solution would be to escape the body before parsing by wave but seem to be a hacky solution and it won't fully fit my need.
Unfortunately those types of escape sequences are compiler-specific extensions. I suspect it would be a low priority (and require special code to enable/disable support) to go into mainline Wave.
If you are interested in implementing this yourself, I suggest changing the lexer definition here. This file is input to a tool called re2c; you can use it to generate a new cpp_re.inc
file in the same directory like this:
cd include/boost/wave/cpplexer/re2clex
re2c -b -o cpp_re.inc cpp.re
Then you can rebuild Wave and see if it works for you. Good luck!
Thank you i'll take a look at it. and keep you updated.
@hkaiser we could maybe put this in the "non-strict" lexer... any thoughts?
I found some more information about the \e
support in gcc here and in clang here
At least i think it could be interesting that an error could be raise when an invalid character like an unknown escaped character is found.
or maybe the parser can process any escaped printable character like gcc and clang do.
what do you think ?
Yeah, I'm not crazy about the "break up the string" result we currently have. Maybe an error would be better? Hopefully @hkaiser will weigh in.
I think, though, that \e
is not the same as e
- it produces an ESC character, if I understand correctly. So that would be special.
I retract that last comment. We just pass through the character and the backslash both when it's a valid control character.
@p8nut did you try making that change to Wave I suggested?
It looks like
\e
is among the conditionally supported escape sequences. For more see: https://cplusplus.github.io/CWG/issues/505.htmlIs there a strong need for this feature?
I think just generally accepting the strings that contain those 'conditionally-supported' characters would be ok. At least clang does that without diagnostics. The current logic in Wave was implemented before CWG 505.
Hello @jefftrull,
i try to change the escape EscapeSequence
in the cpp.re
, strict_cpp.re
, idl.re
where
diff --git a/include/boost/wave/cpplexer/re2clex/cpp.re b/include/boost/wave/cpplexer/re2clex/cpp.re
index 9839901..3567689 100644
--- a/include/boost/wave/cpplexer/re2clex/cpp.re
+++ b/include/boost/wave/cpplexer/re2clex/cpp.re
@@ -31,7 +31,7 @@ IntegerSuffix = [uU] [lL]? | [lL] [uU]?;
LongIntegerSuffix = [uU] ("ll" | "LL") | ("ll" | "LL") [uU]?;
MSLongIntegerSuffix = "u"? "i64";
Backslash = [\\] | "??/";
-EscapeSequence = Backslash ([abfnrtv?'"] | Backslash | "x" HexDigit+ | OctalDigit OctalDigit? OctalDigit?);
+EscapeSequence = Backslash ([\x20-\x7E] | Backslash | "x" HexDigit+ | OctalDigit OctalDigit? OctalDigit?);
HexQuad = HexDigit HexDigit HexDigit HexDigit;
UniversalChar = Backslash ("u" HexQuad | "U" HexQuad HexQuad);
Newline = "\r\n" | "\n" | "\r";
when i try with now it produce as expected the following output
[/tmp/./test.cpp:1:1]char
[/tmp/./test.cpp:1:5]
[/tmp/./test.cpp:1:6]*
[/tmp/./test.cpp:1:7]c
[/tmp/./test.cpp:1:8]
[/tmp/./test.cpp:1:9]=
[/tmp/./test.cpp:1:10]
[/tmp/./test.cpp:1:11]"test\e"
[/tmp/./test.cpp:1:19];
[/tmp/./test.cpp:1:20]
[/tmp/./test.cpp:2:1]
this solution is a little bit more too inclusive and escape any printable character which is maybe not what we want...
@hkaiser are you comfortable with this? Also, should we distinguish in this case between the "strict" and non-strict lexers?
The standard says that (https://eel.is/c++draft/lex.ccon#nt:conditional-escape-sequence-char):
conditional-escape-sequence-char:
any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char,
or the characters N, o, u, U, or x
So I think we shouldn't use [\x20-\x7E]
as a catch-all regex.
ok it is possible to negate the regexp which look like that now
[^\\01234567NouUx\x00-\x1F\x7F]
which is based on the eel.is definition.
but with that definition if there is an incomplete definition it break again
char *c = "test\N";
produce
[/tmp/test.cpp:1:1]char
[/tmp/test.cpp:1:5]
[/tmp/test.cpp:1:6]*
[/tmp/test.cpp:1:7]c
[/tmp/test.cpp:1:8]
[/tmp/test.cpp:1:9]=
[/tmp/test.cpp:1:10]
[/tmp/test.cpp:1:11]"
[/tmp/test.cpp:1:12]test
[/tmp/test.cpp:1:16]\
[/tmp/test.cpp:1:17]N
[/tmp/test.cpp:1:18]"
[/tmp/test.cpp:1:19];
[/tmp/test.cpp:1:20]
[/tmp/test.cpp:2:1]
maybe the eager match could be an option or when we match a stringlit we could use the eager match because inside a string lit (i think) there is no need to process octal and or any other sequence
Frankly, I'm not sure why \N
is not allowed. I'm not aware of any possibly conflicting escaped character sequence. We should conform to the Standard, however (there is usually reason to the madness ;-) ).
My only point is that when we encounter a stringlit it should not break it apart because there is an escape character inside
My only point is that when we encounter a stringlit it should not break it apart because there is an escape character inside
Should we report an error instead?
for my project, the best solution would be to get the stringlit
otherwise an exception could do.
(sorry for the long time to answer :-) )
\N
is excluded because it begins a "named universal character escape", new in C++23.
Currently our approach to many invalid things is to report them as a sequence of simpler tokens instead, so breaking up a string literal in the face of an erroneous escape sequence seems OK to me.
I propose that we:
slex
uses a Spirit Classic escape sequence parser, which we will need to fork.@p8nut can you make progress with your fork? I think we will miss the 1.81 cutoff.
\N
is excluded because it begins a "named universal character escape", new in C++23.
Right, I missed that. Thanks!
Upon reflection - and seeing what would need to change to support it - I decided to try implementing just \e
and \E
, which seem to be supported by both gcc and Clang. I couldn't find any others they both supported. Let me know what you think of the above PR and whether it would be enough. I think we could squeeze it in for 1.81.
During the parsing, it seem that Wave is not able to parse properly a string containing the escape character
\e
. Using a slightly modified quick_start exemple you can reproduce the following output. change the line 93 withstd::cout << "["<< current_position <<"]" << (*first).get_value() << std::endl;
to display each line and token on a different line.the following C file produce
C file:
Output
the following C file produce
C file:
Output
as you can see the
stringlit
is not detected anymore as the right type.is there any solution to handle that right ?