Closed etalian closed 2 years ago
Note, this is just explicitly failing on the unsupported code, it should be a small task to pick up.
Hey! Is this issue still open? If yes, I would like to give it a try. This is my first time working on open source. So any advice I can get on getting started with this would be welcome.
I have gone through the CONTRIBUTIONS and CODE_OF_CONDUCT.
I think @etalian is focused on #1971 so probably fine for you to grab, @vinsdragonis
So, after analyzing the files under /carbon-lang/common/
, I understood that we don't want the stack trace to be printed for the case where a Unicode escape sequence is encountered. The stack trace is printed when we call ExitingStream()
under check_internal.h
. And this was also printed in the given case since we are calling the same from CARBON_RAW_EXITING_STREAM()
under check.h
.
So here's my approach to fixing it. We can create a standalone function that behaves like the CARBON_FATAL()
method defined under check.h
, except that it directly prints the FATAL failure
message without having to print the stack trace only if necessary, print the message saying that escape sequence is not supported and then exits the program.
The function could go something like this.
#define CARBON_FATAL_UNSUPPORTED()
<< "FATAL failure at " << __FILE__ << ":" << __LINE__ << ": "
While this may not be the most efficient approach to patch up the problem, there are still a few other escape sequences and more features yet to be implemented. If there is a need to only print a message and exit at the time, this could prove helpful.
Please let me know if I could go ahead with it. And please excuse me if my approach doesn't seem up to the mark as is my first time trying open source. Also, do let me know if there need to be any additional modifications to my approach if it seems partially correct.
While that would likely make the error message less jarring, the issue here is that an error is raised in the first place. The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case. C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code. Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.
NOTE: in C++, a u
prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them in std::string
s, as the current implementation does).
Agreed with @etalian -- while the message isn't ideal, the intent of this issue is to fix it by adding support, not just change the message. I think UTF-8 is the right solution too.
While that would likely make the error message less jarring, the issue here is that an error is raised in the first place. The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case. C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code. Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.
NOTE: in C++, a
u
prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them instd::string
s, as the current implementation does).
K, I will look into this.
Okay, so after a bit of research on the function mentioned by @etalian, an alternate idea came up. Why not consider writing a function which can manually convert the hex value to a UTF-8 string? If this is something that seems fine, please do let me know.
This is the proposed function that could be used for hex to UTF-8 conversions:
#include <iomanip>
auto stoh(std::string const& in)
{
std::ostringstream os;
for(unsigned char const& c : in)
{
os << std::hex << std::setprecision(1) << std::setw(1)
<< std::setfill('0') << static_cast<char>(c);
}
return os.str();
}
No problems if this is not what is needed. I can always go with the UTF-8 function mentioned by @etalian if that is what's expected.
For a quick description of how UTF-8 works: https://fasterthanli.me/articles/working-with-strings-in-rust#a-very-quick-utf-8-primer
Let's say, to reuse the same example in the article, that our Carbon string contains the Unicode literal \u{E9}
.
This value encodes the symbol é
.
The UTF-8 representation of this letter is two bytes, 0xC3
and 0xA9
(but in general it can be anything between 1 and 4 bytes, it always depends on the specific symbol).
So in this example, the conversion function would take the integer 0xE9
as input, and produce a string containing the two bytes 0xC3
and 0xA9
as output.
As another example, the emoji 😁 corresponds to the Unicode literal \u{1F601}
, which in UTF-8 is 4 bytes (0xf0
0x9f
0x98
0x81
).
Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.
Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.
Sure, I'm happy to collaborate with anyone willing to help 😄
I added a first draft of the new code to my fork at https://github.com/Pritjam/carbon-lang. @vinsdragonis , could you take a look and see what you think/make changes? Specifically, I added code to string_helpers.cpp (in the UnescapeStringLiteral switch-case block) and a test in string_helpers_test.cpp.
@Pritjam, sure I will have a look at this.
I have drafted test cases for the same. Unfortunately, this doesn't seem to recognize the escape sequence yet since we haven't parsed the string to get those literals.
We need to extract the literals in the ParseBlockStringLiteral()
function under string_helpers.cpp
. Then we can parse the literals individually.
While that didn't work, the code for the escape sequence seems logically correct to me, though I will need to confirm the same.
I think the reason the escape sequences in the new test case aren't being recognized is that they aren't formatted quite correctly. Looking at the docs for string literals (docs/design/lexical_conventions/string_literals.md), Unicode code points are specified with \u{HHHH...}, and the new test case omits these brackets.
As for the ParseBlockStringLiteral(), I don't think we need to extract the literals there, since that function calls UnescapeStringLiteral, so we just have to implement the Unicode literal parsing logic in UnescapeStringLiteral. This also makes sense because that's where the old code had a CARBON_FATAL() to show that Unicode wasn't implemented yet.
So I think we have 3 objectives we're trying to accomplish here:
I think with what we have right now, we accomplish the first two objectives. Next, you should add a test case or some other way of testing the functionality (which you pretty much have, we just need to get your test case working) and then we can go ahead and make a pull request of the entire commit chain. What do you think of that?
Yeah, it seems good to me. As for the test case, I have rectified it by sticking to the format \u{HHHH}. Do check it out and let me know if it seems right.
I saw the change you made in the unicode.carbon test. It's almost correct! The one other change is, the string you're comparing against (on line 15 of that file) is still the old string, "str". I'll go ahead and fix that real quick, and then I think we should be good to make a PR.
Also, do you know how to run the tests? It took me a bit to figure out as well, but it's super useful to be able to iteratively test your code. To run the unicode.carbon test, you have to run bazel test //explorer/testdata:string/unicode.carbon.test
. You can see that command contains the path //explorer/testdata:string/unicode.carbon.test
. It starts with //explorer/testdata
because that's a folder with a BUILD file in it, then specifies :string/unicode.carbon.test
to say which test to run.
To run the string_helpers_test.cpp
test, instead run bazel test common/string_helpers_test
. This specifies a test in the common
directory instead.
Alright, the code is good to go. We should be able to make a PR now. Do you want to go ahead and make that?
Sure, I'll do that
I think #2027 resolves this.
While writing doctests for
string_literals.md
, this came up: Input:Output: