boostorg / spirit

Boost.org spirit module
http://boost.org/libs/spirit
392 stars 161 forks source link

X3: no unicode string literals #496

Closed cmazakas closed 5 years ago

cmazakas commented 5 years ago

The following code fails to compile:

https://wandbox.org/permlink/sVVNLHWXWtvrhMKm

Looking at the source code for literal.hpp, it seems like Unicode is nowhere to be found but standard_wide is.

How does one work with UTF encoded string literals in X3?

djowel commented 5 years ago

Typically, use UTF8 and the UTF8 iterators in boost: http://tinyurl.com/y4z7xgme or some other means. That said, there's no reason not to support unicode. It's just that x3 predated these facilities. I'll work on it.

djowel commented 5 years ago

lit support for 32-bit unicode has been added (develop branch). Take note that X3 supports only full char32_t code points, hence the code changes:

#define BOOST_SPIRIT_X3_UNICODE

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support/char_encoding/unicode.hpp>

#include <boost/utility/string_view.hpp>

#include <iostream>
#include <utility>

namespace x3 = boost::spirit::x3;

using namespace boost::spirit::x3::unicode;

int
main()
{
   auto const input = boost::u32string_view(U"rawr");

   auto pos = input.begin();
   bool r = x3::parse(pos, input.end(), lit(U"rawr"));

   return 0;
}

Also, take note that there are more hidden unicode facilities in spirit (e.g. see https://github.com/boostorg/spirit/blob/develop/include/boost/spirit/home/support/char_encoding/unicode.hpp). And there's a lot! These need to be aded to the char class parsers. Currently, only the basics are provided: https://github.com/boostorg/spirit/blob/develop/include/boost/spirit/home/x3/char/char_class.hpp

There's no unicode for those yet. It would be great if you could contribute/add the missing pieces if you have time.

See https://github.com/boostorg/spirit/issues/499

cmazakas commented 5 years ago

Actually, I'm not opposed to contributing.

So, I inspected the source a bit more. I'm not sure what I'm missing that'd make:

BOOST_SPIRIT_X3_CHAR_CLASSES(unicode)

not Just Work.

It can be a little tough to collaborate through github issues though. Do you happen to be on the cpplang slack? I've tried IRC but it's kind of a ghost town.

djowel commented 5 years ago

Well, examples, tests to make sure everything is good; ... and ahem ahem... docs. As you can see a lot of the heavy lifting has already been implemented. Also it would be nice to provide more utf8 (or even utf16) conversion examples. utf8 is still the most optimal as a front-end, which then converts to char32_t code points that spirit can process directly.

There is a Spirit mailing list, actually, and also a Boost-devel mailing list. For now, we can probably just use github issues and PRs as it is right now.

cmazakas commented 5 years ago

Oh, docs! I can write those!

And yeah, the test suite could be a bit more fleshed out. I can try helping with that too.

In that case, I have like a million questions then. But it'll probably be the most time-efficient if I write some examples with documentation and then have you review that and we can build from there.

djowel commented 5 years ago

Awesome! I'm glad I asked. Actually, beyond the char classes, there's a lot of docs that need to be written :-)