RFC: Unicode support - Githubissues

manpages commented 10 years ago

Not sure if bug, so marked it as "RFC".

Consider the following (almost reduced) example:

enum erlang = `
Erlang:
  Atom           <~ ( alpha ( [A-Za-z0-9@_] )* )
  Latin1         <- .

  ## Test helper
  __letterseqs      <- ( Atom    eoi )
                     / ( ^Latin1  eoi )
`;

mixin(grammar(erlang));

unittest {
  auto letterseqsTester = new GrammarTester!(Erlang, "__letterseqs");
  letterseqsTester.assertSimilar(`¥`, `__letterseqs->Latin1`);
}

Attempt to run unit test will result in the following failure:

core.exception.AssertError@../../../.dub/packages/pegged-master/pegged/tester/grammartester.d(51): Erlang failed to parse (left-hand-side).  Details:
Erlang.__letterseqs (failure)
 +-Erlang.Latin1 [0, 1][x"C2"c]
 |  +-any [0, 1][x"C2"c]
 +-eoi failure at line 0, col 1, after " expected end of input, but got "�"

What's the problem with this approach? D supports Unicode in string literals, doesn't it? I'm very confused by this behaviour.

mihails-strasuns commented 10 years ago

This is happening because "any" rule is implemented like this (in pegged/peg.d):

ParseTree any(ParseTree p)
{
    if (p.end < p.input.length)
        return ParseTree("any", true, [p.input[p.end..p.end+1]], p.input, p.end, p.end+1);
    else
       return ParseTree("any", false, ["any char"], p.input, p.end, p.end);
}

As it does raw string byte slicing, no unicode decoding happens. popFront & Co needs to be used here to be UTF-8 correct - however, judging by this snippet, pegged in general does not seem ready to handle unicode so much more may need to be fixed.

@PhilippeSigaud what is the current state of Pegged in regards of unicode?

mihails-strasuns commented 10 years ago

What really worries me is that this seems to work:

import std.encoding;
Latin1String yen;
transcode("¥", yen);
letterseqsTester.assertSimilar(cast(string)yen, `__letterseqs->Latin1`);

This indicated that pegged indeed treats string types as if those were Latin1 encoded types.

PhilippeSigaud commented 10 years ago

I made no effort to be compatible with Unicode, yes.

First because at the time I was writing Pegged (2-3 years ago), the D community itself was not so sure about how to treat Unicode (ranges of chars, of dchars, etc). Second, because I had to stay as far as possible from certain constructs that did not work at compile-time at the time. I think popFront() was one of them...

I'm OK with changing things now if Unicode functions work at CT in D. What should I do with any ?

mihails-strasuns commented 10 years ago

I am afraid it may need full review, not just patching few functions. Pretty much any place where you slice string type is likely be broken for UTF-8

Also parsing non unicode input (like Latin1) should also be supported (right now Pegged it accepts only string). I will need to have a more detailed look how text is processed internally to make a sound proposal.

PhilippeSigaud commented 10 years ago

On Wed, Aug 27, 2014 at 7:11 PM, Михаил Страшун notifications@github.com wrote:

I am afraid it may need full review, not just patching few functions. Pretty much any place where you slice string type is likely be broken for UTF-8

Right, I understand. I'd have to look at the code again. Maybe I slice the input only in the terminal nodes (a few functions), but then maybe not.

Also parsing non unicode input (like Latin1) should also be supported (right now Pegged it accepts only string). I will need to have a more detailed look how text is processed internally to make a sound proposal.

I plan to do a rewrite / code cleaning in the coming months. I alreayd have a GLL engine that's OK but still far too slow and would like to add LALR(1) (or GLR) to Pegged.

May I propose you continue your excellent work on DMD/Phobos for now? I'll call you again when I'm ready to add Unicode.

I'll let this issue open, as it's an important missing feature.

mihails-strasuns commented 10 years ago

I am happy to accept your proposal :) Just ping me when some advice / review on encoding stuff is needed.

dlang-community / Pegged

RFC: Unicode support #140