Open manpages opened 10 years ago
This is happening because "any" rule is implemented like this (in pegged/peg.d
):
ParseTree any(ParseTree p)
{
if (p.end < p.input.length)
return ParseTree("any", true, [p.input[p.end..p.end+1]], p.input, p.end, p.end+1);
else
return ParseTree("any", false, ["any char"], p.input, p.end, p.end);
}
As it does raw string byte slicing, no unicode decoding happens. popFront
& Co needs to be used here to be UTF-8 correct - however, judging by this snippet, pegged in general does not seem ready to handle unicode so much more may need to be fixed.
@PhilippeSigaud what is the current state of Pegged in regards of unicode?
What really worries me is that this seems to work:
import std.encoding;
Latin1String yen;
transcode("¥", yen);
letterseqsTester.assertSimilar(cast(string)yen, `__letterseqs->Latin1`);
This indicated that pegged indeed treats string
types as if those were Latin1
encoded types.
I made no effort to be compatible with Unicode, yes.
First because at the time I was writing Pegged (2-3 years ago), the D community itself was not so sure about how to treat Unicode (ranges of chars, of dchars, etc). Second, because I had to stay as far as possible from certain constructs that did not work at compile-time at the time. I think popFront() was one of them...
I'm OK with changing things now if Unicode functions work at CT in D. What should I do with any
?
I am afraid it may need full review, not just patching few functions. Pretty much any place where you slice string
type is likely be broken for UTF-8
Also parsing non unicode input (like Latin1) should also be supported (right now Pegged it accepts only string
). I will need to have a more detailed look how text is processed internally to make a sound proposal.
On Wed, Aug 27, 2014 at 7:11 PM, Михаил Страшун notifications@github.com wrote:
I am afraid it may need full review, not just patching few functions. Pretty much any place where you slice string type is likely be broken for UTF-8
Right, I understand. I'd have to look at the code again. Maybe I slice the input only in the terminal nodes (a few functions), but then maybe not.
Also parsing non unicode input (like Latin1) should also be supported (right now Pegged it accepts only string). I will need to have a more detailed look how text is processed internally to make a sound proposal.
I plan to do a rewrite / code cleaning in the coming months. I alreayd have a GLL engine that's OK but still far too slow and would like to add LALR(1) (or GLR) to Pegged.
May I propose you continue your excellent work on DMD/Phobos for now? I'll call you again when I'm ready to add Unicode.
I'll let this issue open, as it's an important missing feature.
I am happy to accept your proposal :) Just ping me when some advice / review on encoding stuff is needed.
Not sure if bug, so marked it as "RFC".
Consider the following (almost reduced) example:
Attempt to run unit test will result in the following failure:
What's the problem with this approach? D supports Unicode in string literals, doesn't it? I'm very confused by this behaviour.