Forever-Young / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 0 forks source link

Allow duplicate names of groups #87

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,
Currently, duplicate names are not allowed, for example this code raises an 
exception because group "a" is defined twice:

>>> regex.match(r'(?<a>here)? or (?<a>here)?', "here or here")
error: duplicate group

I suspect this design is a legacy after standard 're' module which didn't allow 
multiple values, so it was somehow natural to reject duplicate group names, 
too. But now, in 'regex' module which can capture repeated values, it would be 
natural to accept also duplicate group names and merge values extracted from 
all same-named groups into one list. 

This enhancement would allow parsing loose formats, where a given value may 
appear in any of several different places in the text and we must prepare a 
regex that has groups in all these places. Usually, we would expect that only 
one place is matched (groups are optional like in regex above), but we can't 
say in advance which one and - for convenience - we'd like to use the same name 
for all these places, to avoid manual merging of several groups afterwards. In 
other use cases, it may be possible that more than 1 group matches and we want 
to extract all the matched values as a single list.

I think this enhancement would fit very well to the concept of repeated 
captures that's already present in 'regex'.

Do any other regex implementations have something like this?
I don't know.

Original issue reported on code.google.com by mwojn...@gmail.com on 23 Jan 2013 at 6:41

GoogleCodeExporter commented 9 years ago
Wouldn't the formats be alternatives, e.g. "(?<found>this)|(?<found>that)"?

The possibility is already covered; the groups are mutually exclusive.

Original comment by re...@mrabarnett.plus.com on 23 Jan 2013 at 7:02

GoogleCodeExporter commented 9 years ago
Alternative is very good for different value patterns, but not for different 
locations. Example: web scraping, complex page where the same value (say, price 
of a product) can appear in 3 different places, depending on the type of 
product:

"(?<price1>\d+)? some-stuff (?<price2>\d+)? other-stuff (?<price3>\d+)?"

Because these are different *locations* in text, not different patterns, and 
the static parts ("some-stuff") must be present in the middle to correctly 
position the groups in entire text, alternative can't be used here (or would be 
very difficult: with static parts copy-pasted several times). Besides, we want 
to extract other properties too, not only price, and want to use single regex 
for all this - without making 3 variants of entire regex and without manual 
labelling of fields 'price1' 'price2' 'price3' and then merging.

Original comment by mwojn...@gmail.com on 24 Jan 2013 at 12:11

GoogleCodeExporter commented 9 years ago
The regex module tries to be compatible with the re module, whose documentation 
says: """Group names must be valid Python identifiers, and each group name must 
be defined only once within a regular expression""".

The regex module relaxes that a little by allowing them multiple times if 
they're mutually exclusive, but I'm not sure whether they should be allowed in 
the version 0 ('compatible') behaviour.

Perhaps only in version 1 ('enhanced') behaviour?

I'll need to think about it and see whether it would have any adverse 
side-effects.

For the record, Perl allows it.

Original comment by re...@mrabarnett.plus.com on 24 Jan 2013 at 2:36

GoogleCodeExporter commented 9 years ago
OK, thanks, for my needs V1 would be fine.
In case if you consider adding it in V0, note that - although this change is 
not strictly compatible with 're' - it does NOT break any existing code, 
because it only relaxes the constraints of correct patterns - any pattern 
correct in 're' would still be correct in 'regex' and behave *exactly* the 
same, with no changes in result; only some more patterns would be considered 
correct now.

Original comment by mwojn...@gmail.com on 24 Jan 2013 at 11:35

GoogleCodeExporter commented 9 years ago
It's true that it wouldn't break any existing code, so there'd be no harm in 
having it work in V0 too.

Original comment by re...@mrabarnett.plus.com on 24 Jan 2013 at 2:05

GoogleCodeExporter commented 9 years ago
Duplicate group names are allowed in regex 0.1.20130124.

Original comment by re...@mrabarnett.plus.com on 24 Jan 2013 at 8:31

GoogleCodeExporter commented 9 years ago
There is a minor issue when the same group is nested - the inner group 
overrides the value matched by the outer group and both are present in the 
result (2 copies of the same inner value). For example:

>>> match = regex.match(r'(?<x>a(?<x>b))', "ab")
>>> match.capturesdict()
{'x': ['b', 'b']}

Original comment by mwojn...@gmail.com on 26 Jan 2013 at 12:09

GoogleCodeExporter commented 9 years ago
Fixed in regex 0.1.20130126.

Original comment by re...@mrabarnett.plus.com on 26 Jan 2013 at 11:39