Regexp / Line Aware Improvements

GoogleCodeExporter commented 9 years ago

I'm using this issue to collect together various possible changes related
regexps and line-aware parsing.

I don't promise to do everything, of course, but at least I won't miss
things by accident if they are listed here.

Original issue reported on code.google.com by acooke....@gmail.com on 23 Nov 2009 at 11:46

GoogleCodeExporter commented 9 years ago

Line aware source doesn't support lines(!) -
http://groups.google.com/group/lepl/browse_thread/thread/a5a813d10f979e14

This looks like a no-brainer - the data until the next EOL should be passed to 
the
Pything regexp.

Original comment by acooke....@gmail.com on 23 Nov 2009 at 11:48

GoogleCodeExporter commented 9 years ago

User-defined regexps for line-aware parsing should automatically exclude 
[^$\n\r]
from "." (and from implicit open ranges via [^....]).

Original comment by acooke....@gmail.com on 23 Nov 2009 at 11:50

GoogleCodeExporter commented 9 years ago

Better syntax for ^ and $.  These are similar-but-not-quite-identical-to the 
standard
regexp notation.  I think it's best to dump them and go with lepl-specific 
syntax.  A
possibility is (*....) which is similar to Python's (?....).  This could also 
be used
for labelling, and could be parsed, maing the str methods for regexp classes
self-consistent.

Original comment by acooke....@gmail.com on 23 Nov 2009 at 11:52

GoogleCodeExporter commented 9 years ago

Eos (Eof) should be considered EOL - a line should end at the end of the 
file/input,
even if there's no newline (or whatever is currently used).

Original comment by acooke....@gmail.com on 23 Nov 2009 at 11:53

GoogleCodeExporter commented 9 years ago

Possible bug - seems to be something odd about "*" in this post -
http://groups.google.com/group/lepl/msg/15b78d0191d5f5b5?dmode=source

Original comment by acooke....@gmail.com on 23 Nov 2009 at 11:54

GoogleCodeExporter commented 9 years ago

I am starting to think this may be quite difficult (it seems to amount to 
emulating a
broken legacy implementation of regexps!), but it would be nice if we supported
Perl/Python's non-greedy alternatives in regexps.

For example (a|ac)c applied to to "acc" should match "ac".

Need to be careful here - I assume this means some matches with nested 
alternatives
will fail.  Check exactly how Python/Perl behave.

Original comment by acooke....@gmail.com on 23 Nov 2009 at 11:57

GoogleCodeExporter commented 9 years ago

Support for non-token line-aware parsing.  There's an example in the docs, but 
it
won't work with Extend (ie across lines).  This may not be reasonably possible, 
in
which case look for alternative support (eg matching line break explicitly).

Original comment by acooke....@gmail.com on 23 Nov 2009 at 9:26

GoogleCodeExporter commented 9 years ago

Line-aware parsing + Empty() bug:
-------------------------------------------------------
from lepl import *
introduce = ~Token(':')
word = Token(Word(Lower()))
statement = Delayed()
simple = BLine(word[:])
empty = BLine(Empty()
block = BLine(word[:] & introduce) & Block(statement[:])
statement += (simple | empty | block) > list
parser = statement[:].string_parser(LineAwareConfiguration(block_policy=2))

result = parser('worda\nwordb:\n  wordc:\n    wordd')
as expected, we got [[u'worda'], [u'wordb', [u'wordc', [u'wordd']]]]
but
result = parser('worda\nwordb:\n\n  wordc:\n    wordd')
returns unexpected [[u'worda'], [u'wordb'], []]

Original comment by aachu...@gmail.com on 27 Nov 2009 at 3:06

GoogleCodeExporter commented 9 years ago

More info on the above (Empty()).

This is actually normal behaviour.  What's happening is that blocks do not 
continue
over empty lines.  So the input data do not match the grammar.  If the lines 
after
the blank line had no space to the left, then they should match (as a new, zero
indented block).

However, we do clearly need some way to include blank lines in blocks - this 
was also
raised on the mailing list.  In fact we probably want to ba able to support 
three
different cases:

 - An empty line means you must start again at the left (as now)
 - An empty line means that you continue with the current indent (in this case, how
do you end a block?)
 - Both the above, depending on context (ie choose which ever fits the indent of the
line after the blank)

And related to this, what about comment blocks that might have an arbitrary 
indent?

Original comment by acooke....@gmail.com on 27 Nov 2009 at 3:26

GoogleCodeExporter commented 9 years ago

More on the above - it may simply be a case of documenting how to use Line() 
rather
than BLine() (see Andrey's email around 28 Nov).

Original comment by acooke....@gmail.com on 28 Nov 2009 at 1:30

GoogleCodeExporter commented 9 years ago

OK, I've fixed the majority of these in 3.3.3.

What I haven't done is (1) emulate the non-greedy choice in regexp or (2) 
provide a
better way to do offside parsing without tokens (I do now warn more clearly in 
the
manual that tokens are necessary).

Original comment by acooke....@gmail.com on 10 Dec 2009 at 12:17

Changed state: Fixed

brianray / lepl

Regexp / Line Aware Improvements #15