Since boost 1.72 spirit::unicode::char_ fails to parse non-ASCII

timo-schluessler commented 3 years ago

Sample code to reproduce the issue:

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>

int main()
{
   typedef std::string::const_iterator iterator_type;
   namespace qi = boost::spirit::qi;
   namespace unicode = boost::spirit::unicode;

   std::string input("\"Test ⏳\"");
   qi::rule<iterator_type, std::string(), unicode::space_type> quoted_string = qi::lexeme['"' >> +(unicode::char_ - '"') >> '"'];

   iterator_type iter = input.begin();
   iterator_type end = input.end();
   std::string output;
   bool r = phrase_parse(iter, end, quoted_string, unicode::space, output);

   if (r && iter == end)
      std::cout << "successfully parsed " << input << " to " << output << std::endl;
   else
      std::cout << "failed to parse " << input << std::endl;

   return 0;
}

Thanks to sehe who bisected the issue down to commit 16159fb. Maybe this behavior is by intention - then I simply don't get the use and meaning of spirit::unicode and BOOST_SPIRIT_UNICODE.

Kojoley commented 3 years ago

The code contains implementation-defined behavior ([lex/1.1]). Please rewrite it without unicode characters in the source code.
Do you intentionally use std::string with unicode? That will encode it with the execution character set.
'"' is not a unicode parser, probably it is the root cause of your issue.

timo-schluessler commented 3 years ago

Sorry I didn't know this was implementation-defined. In the real program the string is read in from a file which is encoded in UTF-8. Please find the updated example below.
Yes. I would like to process the UTF-8 as plain 8-bit chars as the file format itself (say the control characters) is exclusively made up of ASCII characters, like the " in the example. The fields/values in the format though may contain any UTF-8 character. (And because of the way UTF-8 encodes unicode characters no single byte could be falsely interpreted as a valid ASCII control char.)
Do you have an idea how to fix this?

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>

int main()
{
   typedef std::string::const_iterator iterator_type;
   namespace qi = boost::spirit::qi;
   namespace unicode = boost::spirit::unicode;

   std::string input("\"Test \xe2\x8f\xb3\"");
   qi::rule<iterator_type, std::string(), unicode::space_type> quoted_string = qi::lexeme['"' >> +(unicode::char_ - '"') >> '"'];

   iterator_type iter = input.begin();
   iterator_type end = input.end();
   std::string output;
   bool r = phrase_parse(iter, end, quoted_string, unicode::space, output);

   if (r && iter == end)
      std::cout << "successfully parsed " << input << " to " << output << std::endl;
   else
      std::cout << "failed to parse " << input << std::endl;

   return 0;
}

Edit: Typo.

Kojoley commented 3 years ago

Yes. I would like to process the UTF-8 as plain 8-bit chars as the file format itself (say the control characters) is exclusively made up of ASCII characters, like the " in the example. The fields/values in the format though may contain any UTF-8 character. (And because of the way UTF-8 encodes unicode characters no single byte could be falsely interpreted as a valid ASCII control char.)

This sounds as a duplicate of #675

Do you have an idea how to fix this?

Use UTF-8 to UTF-32 conversion iterator, if it does not help try to replace '"' with unicode::lit('"').

3dyd commented 3 years ago

I suppose this is not a duplicate. While the reason is basically the same (and it is well described here), mine is about character classification checks, and this one happens before that, when both Qi and X3 implicitly convert value of signed type to boost::uint32_t while calling unicode::ischar: https://github.com/boostorg/spirit/blob/db8bdf3d718472149b9fdb7f56f8ab2e002748ed/include/boost/spirit/home/support/char_encoding/unicode.hpp#L41-L46

(0xE2 char becomes 0xFFFFFFE2 uint32_t)

For example, standard::ischar accounts possibility of signed source values this way: https://github.com/boostorg/spirit/blob/db8bdf3d718472149b9fdb7f56f8ab2e002748ed/include/boost/spirit/home/support/char_encoding/standard.hpp#L36-L42

@timo-schluessler you example would work if you use standard encoding instead of unicode (simply change unicode to standard).

Also, as it appeared to be, my issue with standard encoding has already been fixed for Qi (6821c820). So, in boost 1.76+ (where this fix has landed) you can also use qi::standard::alpha and other character classification parses to match ASCII markup (assuming that default C locale is used).

Kojoley commented 3 years ago

Mixing unicode with non-unicode parsers, using unicode parsers on non-unicode input, or non-unicode parsers on unicode input is not supported. I had been researching the problem and prototyping a solution (last time when reviewing #655/#649) but there are behavior choices with no 'one fits all'.

timo-schluessler commented 3 years ago

Thanks for your replies. unicode::lit() does not exist but using standard instead of unicode fixes my issue. And I somewhat get the sense of it such that if the grammar only uses ASCII then I use standard. If the grammar would contain special characters then I would have to use unicode and also work with wide characters. Does that make sense?

Trigve commented 3 years ago

I'm also hit by this while upgrading to boost 1.76.0.

I'm still bit puzzled after reading documentation as there is minimum info about character encoding namespaces. What's the difference between standard, standard_wide and unicode?

How I see it is that standard does use char and standard_wide uses wchar_t. But the string parsed could be in any encoding when using standard?

Kojoley commented 3 years ago

I'm still bit puzzled after reading documentation as there is minimum info about character encoding namespaces. What's the difference between standard, standard_wide and unicode?

Obviously the difference is encoding.

How I see it is that standard does use char and standard_wide uses wchar_t. But the string parsed could be in any encoding when using standard?

standard uses standard classification functions from ctype.h header, which use global C locale, which defines the encoding (J.4 Locale-specific behavior).
standard_wide uses standard classification functions from wctype.h header, an encoding is implementation-defined unless __STDC_ISO_10646__ macro is defined (J.3.4 Characters).
unicode is UTF-32.

hhaoao commented 2 years ago

How to print unicode _attr?

 error: invalid operands to binary expression ('std::ostream' (aka 'basic_ostream<char>') and 'std::vector<char32_t>')

#define BOOST_SPIRIT_X3_UNICODE
#include <boost/spirit/home/x3.hpp>

auto f1 = [](auto& ctx){std::cout << _attr(ctx) << std::endl;}

x3::rule<class tree, ast::tree> const tree = "tree";

auto const tree_def =
    lexeme[+(char_ -(eol))][f1]
    >> int_
    ;

tdauth commented 2 months ago

Any updates on this issue? Will this be fixed or is there any workaround? I use code like this:

typedef std::istreambuf_iterator<byte> IteratorType;
    typedef boost::spirit::multi_pass<IteratorType> ForwardIteratorType;

    ForwardIteratorType first = boost::spirit::make_default_multi_pass(IteratorType(istream));
    ForwardIteratorType last;

    // used for backtracking and more detailed error output
    namespace classic = boost::spirit::classic;
    typedef classic::position_iterator2<ForwardIteratorType> PositionIteratorType;
    PositionIteratorType position_begin(first, last);
    PositionIteratorType position_end;

    try
    {
        if (!client::parse(position_begin, position_end, this->sections()))
        {
            throw Exception(_("Parsing error."));
        }
    }

...

typedef char byte;
typedef std::basic_istream<byte> InputStream;

Does this mean I have to change my char into some UTF8 type now to have a basic_istream for UTF8? While the design decision might make sense, it breaks older code.

saki7 commented 2 months ago

@tdauth I would suggest porting your code to X3, and use the char-related parsers provided in the correct namespaces, like @Kojoley mentioned: https://github.com/boostorg/spirit/issues/678#issuecomment-854764090

Also: (quoting from https://github.com/boostorg/spirit/issues/678#issuecomment-846622462)

Mixing unicode with non-unicode parsers, using unicode parsers on non-unicode input, or non-unicode parsers on unicode input is not supported.

I feel that the problem on this Issue is thoroughly explained here, and I think that there's no actual issue in the Spirit's implementation.

As a sidenote:

I'm using X3's char parsers from the unicode namespaces, alongside with std::u32string::const_iterator, and they're working fine without any encoding related issues.
I'm a native Japanese speaker who uses CJK (aka multibyte) characters on a daily basis, and I am pretty much familiar with the unicode-related implementation failures in western OSS. The unicode implementation of Spirit.X3 is concrete, and it can surely handle UTF-32 iterators.

tdauth commented 1 month ago

So for UTF-8 files I could simply use std::u8string and char8_t instead of std::string and char in Boost Spirit 2? I don't know yet how to migrate to X3, so I am looking for the easiest solution.

saki7 commented 1 month ago

@tdauth Yes, you should pass unicode iterators to Spirit. Again, I would strongly recommend using X3, since Spirit.Qi is no longer actively maintained. Feel free to open a new issue with your specific code, if you have further questions.

boostorg / spirit

Since boost 1.72 spirit::unicode::char_ fails to parse non-ASCII #678