Capture regular expression groups when lexing.

alex / rply

An attempt to port David Beazley's PLY to RPython, and give it a cooler API.

BSD 3-Clause "New" or "Revised" License

381 stars 60 forks source link

Capture regular expression groups when lexing. #27

Open amcgregor opened 10 years ago

amcgregor commented 10 years ago

As certain token constructs represent elements being wrapped—such as text being wrapped in enclosing quotes—the parser step would need to pre-process the token to remove the quotes and identify flags (in the case for Python-style prefixed strings anyway. Why do the work twice?

The attached changes add slots and update __repr__ implementations where needed, and include a test for the "quoted string" case, demonstrating use. Documentation is also updated to clearly demonstrate the "quoted string" use case and update the presented object repr output.

alex commented 10 years ago

Thanks for contributing! Is the primary motivation here performance?

I'm not sure it's possible to make this compatible with the RPython version, so that will require some thinking before this can land.

amcgregor commented 10 years ago

I've been investigating the translation failure; it was my understanding that container types (lists, tuples, etc.) must be type-homogeneous internally (groups are always tuples of strings) and that None was an allowed exception to this (as there may be no groups, None is a possible value instead of a tuple). I may be missing something, though, as I'm very new to RPython.

The primary motivator is de-duplication of work. The regex during the tokenization step will already be capturing the groups, but the tokenizer just throws that information away. In the string parsing example the key elements needed (Python-style string flags and the contents of the quoted string) would need to be re-extracted in the parser.

alex commented 10 years ago

lists need to be homogenous internally, tuples are allowed to be heterogenous, but must always be the same length (and have the same types at the same positions). If they're all strings, maybe using lists here makes sense?

On Mon, May 5, 2014 at 10:32 AM, Alice Zoë Bevan–McGregor < notifications@github.com> wrote:

I've been investigating the translation failure; it was my understanding that container types (lists, tuples, etc.) must be type-homogeneous internally (groups are always tuples of strings) and that None was an allowed exception to this (as there may be no groups, None is a possible value instead of a tuple). I may be missing something, though, as I'm very new to RPython.

— Reply to this email directly or view it on GitHubhttps://github.com/alex/rply/pull/27#issuecomment-42214051 .

"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero GPG Key fingerprint: 125F 5C67 DFE9 4084

amcgregor commented 10 years ago

Indeed, lists would make more sense now that I know tuples are even weirder than I expected. ;) Let me patch and see if this fixes the test failure locally.

amcgregor commented 10 years ago

So, I've converted the regex group storage to using a list, however this has not corrected the somewhat mystifying translation error I'm getting:

E           AnnotatorError: 
E           
E           signature mismatch: __init__() takes exactly 4 arguments (3 given)
E           
E           
E           Occurred processing the following simple_call:
E                 (AttributeError getting at the binding!)
E               v3 = simple_call(v0, v1, v2)
E           
E           In <FunctionGraph of (rply.lexer:34)LexerStream.next at 0x10a3cb088>:
E           Happened at file /Users/amcgregor/Documents/Clueless/tmp/rply/rply/lexer.py line 43
E           
E           ==>             match = rule.matches(self.s, self.idx)
E                           if match:
E           
E           Known variable annotations:
E            v0 = SomeBuiltin(analyser=<rpython.tool.descriptor.InstanceMethod object at 0x000000010a4b44f0>, methodname='matches', s_self=SomeRule())
E            v1 = SomeString(no_nul=True)
E            v2 = SomeInteger(const=0, knowntype=int, nonneg=True, unsigned=False)

rule.matches() isn't an __init__ call. :/

alex commented 10 years ago

I think the answer is that the code in the if rpython section at the bottom of lexergenerator.py needs to be expanded to add the additional details on Matches. I haven't investigated what exactly needs adding though (I'm travelling ATM, will be more available from wednesday on).