Open timo-schluessler opened 3 years ago
std::string
with unicode? That will encode it with the execution character set.#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>
int main()
{
typedef std::string::const_iterator iterator_type;
namespace qi = boost::spirit::qi;
namespace unicode = boost::spirit::unicode;
std::string input("\"Test \xe2\x8f\xb3\"");
qi::rule<iterator_type, std::string(), unicode::space_type> quoted_string = qi::lexeme['"' >> +(unicode::char_ - '"') >> '"'];
iterator_type iter = input.begin();
iterator_type end = input.end();
std::string output;
bool r = phrase_parse(iter, end, quoted_string, unicode::space, output);
if (r && iter == end)
std::cout << "successfully parsed " << input << " to " << output << std::endl;
else
std::cout << "failed to parse " << input << std::endl;
return 0;
}
Edit: Typo.
Yes. I would like to process the UTF-8 as plain 8-bit chars as the file format itself (say the control characters) is exclusively made up of ASCII characters, like the " in the example. The fields/values in the format though may contain any UTF-8 character. (And because of the way UTF-8 encodes unicode characters no single byte could be falsely interpreted as a valid ASCII control char.)
This sounds as a duplicate of #675
Do you have an idea how to fix this?
Use UTF-8 to UTF-32 conversion iterator, if it does not help try to replace '"'
with unicode::lit('"')
.
I suppose this is not a duplicate. While the reason is basically the same (and it is well described here), mine is about character classification checks, and this one happens before that, when both Qi and X3 implicitly convert value of signed type to boost::uint32_t
while calling unicode::ischar
:
https://github.com/boostorg/spirit/blob/db8bdf3d718472149b9fdb7f56f8ab2e002748ed/include/boost/spirit/home/support/char_encoding/unicode.hpp#L41-L46
(0xE2 char
becomes 0xFFFFFFE2 uint32_t
)
For example, standard::ischar
accounts possibility of signed source values this way:
https://github.com/boostorg/spirit/blob/db8bdf3d718472149b9fdb7f56f8ab2e002748ed/include/boost/spirit/home/support/char_encoding/standard.hpp#L36-L42
@timo-schluessler you example would work if you use standard
encoding instead of unicode
(simply change unicode
to standard
).
Also, as it appeared to be, my issue with standard
encoding has already been fixed for Qi (6821c820). So, in boost 1.76+ (where this fix has landed) you can also use qi::standard::alpha
and other character classification parses to match ASCII markup (assuming that default C locale is used).
Mixing unicode with non-unicode parsers, using unicode parsers on non-unicode input, or non-unicode parsers on unicode input is not supported. I had been researching the problem and prototyping a solution (last time when reviewing #655/#649) but there are behavior choices with no 'one fits all'.
Thanks for your replies. unicode::lit()
does not exist but using standard
instead of unicode
fixes my issue. And I somewhat get the sense of it such that if the grammar only uses ASCII then I use standard
. If the grammar would contain special characters then I would have to use unicode
and also work with wide characters. Does that make sense?
I'm also hit by this while upgrading to boost 1.76.0.
I'm still bit puzzled after reading documentation as there is minimum info about character encoding namespaces. What's the difference between standard
, standard_wide
and unicode
?
How I see it is that standard
does use char
and standard_wide
uses wchar_t
. But the string parsed could be in any encoding when using standard
?
I'm still bit puzzled after reading documentation as there is minimum info about character encoding namespaces. What's the difference between
standard
,standard_wide
andunicode
?
Obviously the difference is encoding.
How I see it is that
standard
does usechar
andstandard_wide
useswchar_t
. But the string parsed could be in any encoding when usingstandard
?
standard
uses standard classification functions from ctype.h
header, which use global C locale, which defines the encoding (J.4 Locale-specific behavior).standard_wide
uses standard classification functions from wctype.h
header, an encoding is implementation-defined unless __STDC_ISO_10646__
macro is defined (J.3.4 Characters).unicode
is UTF-32.How to print unicode _attr?
error: invalid operands to binary expression ('std::ostream' (aka 'basic_ostream<char>') and 'std::vector<char32_t>')
#define BOOST_SPIRIT_X3_UNICODE
#include <boost/spirit/home/x3.hpp>
auto f1 = [](auto& ctx){std::cout << _attr(ctx) << std::endl;}
x3::rule<class tree, ast::tree> const tree = "tree";
auto const tree_def =
lexeme[+(char_ -(eol))][f1]
>> int_
;
Any updates on this issue? Will this be fixed or is there any workaround? I use code like this:
typedef std::istreambuf_iterator<byte> IteratorType;
typedef boost::spirit::multi_pass<IteratorType> ForwardIteratorType;
ForwardIteratorType first = boost::spirit::make_default_multi_pass(IteratorType(istream));
ForwardIteratorType last;
// used for backtracking and more detailed error output
namespace classic = boost::spirit::classic;
typedef classic::position_iterator2<ForwardIteratorType> PositionIteratorType;
PositionIteratorType position_begin(first, last);
PositionIteratorType position_end;
try
{
if (!client::parse(position_begin, position_end, this->sections()))
{
throw Exception(_("Parsing error."));
}
}
...
typedef char byte;
typedef std::basic_istream<byte> InputStream;
Does this mean I have to change my char into some UTF8 type now to have a basic_istream for UTF8? While the design decision might make sense, it breaks older code.
@tdauth I would suggest porting your code to X3, and use the char-related parsers provided in the correct namespaces, like @Kojoley mentioned: https://github.com/boostorg/spirit/issues/678#issuecomment-854764090
Also: (quoting from https://github.com/boostorg/spirit/issues/678#issuecomment-846622462)
Mixing unicode with non-unicode parsers, using unicode parsers on non-unicode input, or non-unicode parsers on unicode input is not supported.
I feel that the problem on this Issue is thoroughly explained here, and I think that there's no actual issue in the Spirit's implementation.
As a sidenote:
std::u32string::const_iterator
, and they're working fine without any encoding related issues.So for UTF-8 files I could simply use std::u8string and char8_t instead of std::string and char in Boost Spirit 2? I don't know yet how to migrate to X3, so I am looking for the easiest solution.
@tdauth Yes, you should pass unicode iterators to Spirit. Again, I would strongly recommend using X3, since Spirit.Qi is no longer actively maintained. Feel free to open a new issue with your specific code, if you have further questions.
Sample code to reproduce the issue:
Thanks to sehe who bisected the issue down to commit 16159fb. Maybe this behavior is by intention - then I simply don't get the use and meaning of spirit::unicode and BOOST_SPIRIT_UNICODE.