etalian commented 2 years ago

While writing doctests for string_literals.md, this came up: Input:

var fret: String = "I would 'twere something that would fret the string,\n" +
                   "The master-cord on's \u{2764}\u{FE0F}!";

Output:

Stack trace:
 #0 0x000055c57065d50b backtrace .../llvm-project-14.0.6.src/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4277:13
 #1 0x000055c570a2779b llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /proc/self/cwd/external/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:13
 #2 0x000055c5706eaa00 Carbon::Internal::ExitingStream::ExitingStream() /proc/self/cwd/./common/check_internal.h:35:3
 #3 0x000055c570925732 Carbon::UnescapeStringLiteral(llvm::StringRef, int, bool) /proc/self/cwd/common/string_helpers.cpp:85:11
 #4 0x000055c5708bd5e2 has_value .../llvm/14.0.6_1/bin/../include/c++/v1/optional:321:22
 #5 0x000055c5708bd5e2 operator bool .../llvm/14.0.6_1/bin/../include/c++/v1/optional:975:64
 #6 0x000055c5708bd5e2 operator==<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > .../llvm/14.0.6_1/bin/../include/c++/v1/optional:1277:31
 #7 0x000055c5708bd5e2 Carbon::ProcessSingleLineString(llvm::StringRef, Carbon::ParseAndLexContext&, unsigned long) /proc/self/cwd/explorer/syntax/lex_scan_helper.cpp:46:17
 #8 0x000055c5708ada92 yylex(Carbon::Arena*, void*, Carbon::ParseAndLexContext&) /proc/self/cwd/explorer/syntax/lexer.lpp:353:20
 #9 0x000055c5708c838a Carbon::Parser::parse() /proc/self/cwd/bazel-out/k8-fastbuild/bin/explorer/syntax/parser.cpp:0:38
#10 0x000055c5708c604d Carbon::Parser::operator()() /proc/self/cwd/bazel-out/k8-fastbuild/bin/explorer/syntax/parser.cpp:1098:5
#11 0x000055c5708a45e7 Carbon::ParseImpl(void*, Carbon::Arena*, std::__1::basic_string_view<char, std::__1::char_traits<char> >, bool) /proc/self/cwd/explorer/syntax/parse.cpp:31:60
#12 0x000055c5708a3503 Carbon::Parse(Carbon::Arena*, std::__1::basic_string_view<char, std::__1::char_traits<char> >, bool) /proc/self/cwd/explorer/syntax/parse.cpp:64:28
#13 0x000055c5706e316f index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:785:12
#14 0x000055c5706e316f index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1435:59
#15 0x000055c5706e316f __holds_alternative<1UL, Carbon::Error, Carbon::AST> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1461:14
#16 0x000055c5706e316f holds_alternative<Carbon::AST, Carbon::Error, Carbon::AST> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1467:10
#17 0x000055c5706e316f ok /proc/self/cwd/./common/error.h:91:36
#18 0x000055c5706e316f Carbon::Main(llvm::StringRef, int, char**) /proc/self/cwd/explorer/main.cpp:73:3
#19 0x000055c5706e214a index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:785:12
#20 0x000055c5706e214a index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1435:59
#21 0x000055c5706e214a __holds_alternative<1UL, Carbon::Error, Carbon::Success> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1461:14
#22 0x000055c5706e214a holds_alternative<Carbon::Success, Carbon::Error, Carbon::Success> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1467:10
#23 0x000055c5706e214a ok /proc/self/cwd/./common/error.h:91:36
#24 0x000055c5706e214a Carbon::ExplorerMain(llvm::StringRef, int, char**) /proc/self/cwd/explorer/main.cpp:91:69
#25 0x000055c5706e15a1 main /proc/self/cwd/explorer/main_bin.cpp:0:10
#26 0x00007f116997e0b3 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b3)
#27 0x000055c570621e0e _start (.../execroot/carbon/bazel-out/k8-fastbuild/bin/explorer/explorer+0x4e6e0e)
FATAL failure at common/string_helpers.cpp:85: \u is not yet supported in string literals

jonmeow commented 2 years ago

Note, this is just explicitly failing on the unsupported code, it should be a small task to pick up.

vinsdragonis commented 2 years ago

Hey! Is this issue still open? If yes, I would like to give it a try. This is my first time working on open source. So any advice I can get on getting started with this would be welcome.

I have gone through the CONTRIBUTIONS and CODE_OF_CONDUCT.

jonmeow commented 2 years ago

I think @etalian is focused on #1971 so probably fine for you to grab, @vinsdragonis

vinsdragonis commented 2 years ago

So, after analyzing the files under /carbon-lang/common/, I understood that we don't want the stack trace to be printed for the case where a Unicode escape sequence is encountered. The stack trace is printed when we call ExitingStream() under check_internal.h. And this was also printed in the given case since we are calling the same from CARBON_RAW_EXITING_STREAM() under check.h.

So here's my approach to fixing it. We can create a standalone function that behaves like the CARBON_FATAL() method defined under check.h, except that it directly prints the FATAL failure message without having to print the stack trace only if necessary, print the message saying that escape sequence is not supported and then exits the program.

The function could go something like this.

#define CARBON_FATAL_UNSUPPORTED()
    << "FATAL failure at " << __FILE__ << ":" << __LINE__ << ": "

While this may not be the most efficient approach to patch up the problem, there are still a few other escape sequences and more features yet to be implemented. If there is a need to only print a message and exit at the time, this could prove helpful.

Please let me know if I could go ahead with it. And please excuse me if my approach doesn't seem up to the mark as is my first time trying open source. Also, do let me know if there need to be any additional modifications to my approach if it seems partially correct.

etalian commented 2 years ago

While that would likely make the error message less jarring, the issue here is that an error is raised in the first place. The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case. C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code. Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.

NOTE: in C++, a u prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them in std::strings, as the current implementation does).

jonmeow commented 2 years ago

Agreed with @etalian -- while the message isn't ideal, the intent of this issue is to fix it by adding support, not just change the message. I think UTF-8 is the right solution too.

vinsdragonis commented 2 years ago

While that would likely make the error message less jarring, the issue here is that an error is raised in the first place. The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case. C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code. Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.

NOTE: in C++, a u prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them in std::strings, as the current implementation does).

K, I will look into this.

vinsdragonis commented 2 years ago

Okay, so after a bit of research on the function mentioned by @etalian, an alternate idea came up. Why not consider writing a function which can manually convert the hex value to a UTF-8 string? If this is something that seems fine, please do let me know.

vinsdragonis commented 2 years ago

This is the proposed function that could be used for hex to UTF-8 conversions:

#include <iomanip>

auto stoh(std::string const& in)
{
    std::ostringstream os;

    for(unsigned char const& c : in)
    {
        os << std::hex << std::setprecision(1) << std::setw(1)
           << std::setfill('0') << static_cast<char>(c);
    }

    return os.str();
}

No problems if this is not what is needed. I can always go with the UTF-8 function mentioned by @etalian if that is what's expected.

etalian commented 2 years ago

For a quick description of how UTF-8 works: https://fasterthanli.me/articles/working-with-strings-in-rust#a-very-quick-utf-8-primer Let's say, to reuse the same example in the article, that our Carbon string contains the Unicode literal \u{E9}. This value encodes the symbol é. The UTF-8 representation of this letter is two bytes, 0xC3 and 0xA9 (but in general it can be anything between 1 and 4 bytes, it always depends on the specific symbol). So in this example, the conversion function would take the integer 0xE9 as input, and produce a string containing the two bytes 0xC3 and 0xA9 as output.

As another example, the emoji 😁 corresponds to the Unicode literal \u{1F601}, which in UTF-8 is 4 bytes (0xf0 0x9f 0x98 0x81).

Pritjam commented 2 years ago

Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.

vinsdragonis commented 2 years ago

Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.

Sure, I'm happy to collaborate with anyone willing to help 😄

Pritjam commented 2 years ago

I added a first draft of the new code to my fork at https://github.com/Pritjam/carbon-lang. @vinsdragonis , could you take a look and see what you think/make changes? Specifically, I added code to string_helpers.cpp (in the UnescapeStringLiteral switch-case block) and a test in string_helpers_test.cpp.

vinsdragonis commented 2 years ago

@Pritjam, sure I will have a look at this.

vinsdragonis commented 2 years ago

I have drafted test cases for the same. Unfortunately, this doesn't seem to recognize the escape sequence yet since we haven't parsed the string to get those literals. We need to extract the literals in the ParseBlockStringLiteral() function under string_helpers.cpp. Then we can parse the literals individually.

While that didn't work, the code for the escape sequence seems logically correct to me, though I will need to confirm the same.

Pritjam commented 2 years ago

I think the reason the escape sequences in the new test case aren't being recognized is that they aren't formatted quite correctly. Looking at the docs for string literals (docs/design/lexical_conventions/string_literals.md), Unicode code points are specified with \u{HHHH...}, and the new test case omits these brackets.

As for the ParseBlockStringLiteral(), I don't think we need to extract the literals there, since that function calls UnescapeStringLiteral, so we just have to implement the Unicode literal parsing logic in UnescapeStringLiteral. This also makes sense because that's where the old code had a CARBON_FATAL() to show that Unicode wasn't implemented yet.

Pritjam commented 2 years ago

So I think we have 3 objectives we're trying to accomplish here:

Add the unicode functionality to the explorer
I want to have a commit incorporated as a contribution to Carbon
You want to have a commit incorporated as a contribution to Carbon

I think with what we have right now, we accomplish the first two objectives. Next, you should add a test case or some other way of testing the functionality (which you pretty much have, we just need to get your test case working) and then we can go ahead and make a pull request of the entire commit chain. What do you think of that?

vinsdragonis commented 2 years ago

Yeah, it seems good to me. As for the test case, I have rectified it by sticking to the format \u{HHHH}. Do check it out and let me know if it seems right.

Pritjam commented 2 years ago

I saw the change you made in the unicode.carbon test. It's almost correct! The one other change is, the string you're comparing against (on line 15 of that file) is still the old string, "str". I'll go ahead and fix that real quick, and then I think we should be good to make a PR.

Also, do you know how to run the tests? It took me a bit to figure out as well, but it's super useful to be able to iteratively test your code. To run the unicode.carbon test, you have to run bazel test //explorer/testdata:string/unicode.carbon.test. You can see that command contains the path //explorer/testdata:string/unicode.carbon.test. It starts with //explorer/testdata because that's a folder with a BUILD file in it, then specifies :string/unicode.carbon.test to say which test to run.

To run the string_helpers_test.cpp test, instead run bazel test common/string_helpers_test. This specifies a test in the common directory instead.

Pritjam commented 2 years ago

Alright, the code is good to go. We should be able to make a PR now. Do you want to go ahead and make that?

vinsdragonis commented 2 years ago

Sure, I'll do that

vinsdragonis commented 2 years ago

2027

jonmeow commented 2 years ago

I think #2027 resolves this.

carbon-language / carbon-lang

Support \u in string literals #1982

2027