Closed GoogleCodeExporter closed 9 years ago
Wouldn't the formats be alternatives, e.g. "(?<found>this)|(?<found>that)"?
The possibility is already covered; the groups are mutually exclusive.
Original comment by re...@mrabarnett.plus.com
on 23 Jan 2013 at 7:02
Alternative is very good for different value patterns, but not for different
locations. Example: web scraping, complex page where the same value (say, price
of a product) can appear in 3 different places, depending on the type of
product:
"(?<price1>\d+)? some-stuff (?<price2>\d+)? other-stuff (?<price3>\d+)?"
Because these are different *locations* in text, not different patterns, and
the static parts ("some-stuff") must be present in the middle to correctly
position the groups in entire text, alternative can't be used here (or would be
very difficult: with static parts copy-pasted several times). Besides, we want
to extract other properties too, not only price, and want to use single regex
for all this - without making 3 variants of entire regex and without manual
labelling of fields 'price1' 'price2' 'price3' and then merging.
Original comment by mwojn...@gmail.com
on 24 Jan 2013 at 12:11
The regex module tries to be compatible with the re module, whose documentation
says: """Group names must be valid Python identifiers, and each group name must
be defined only once within a regular expression""".
The regex module relaxes that a little by allowing them multiple times if
they're mutually exclusive, but I'm not sure whether they should be allowed in
the version 0 ('compatible') behaviour.
Perhaps only in version 1 ('enhanced') behaviour?
I'll need to think about it and see whether it would have any adverse
side-effects.
For the record, Perl allows it.
Original comment by re...@mrabarnett.plus.com
on 24 Jan 2013 at 2:36
OK, thanks, for my needs V1 would be fine.
In case if you consider adding it in V0, note that - although this change is
not strictly compatible with 're' - it does NOT break any existing code,
because it only relaxes the constraints of correct patterns - any pattern
correct in 're' would still be correct in 'regex' and behave *exactly* the
same, with no changes in result; only some more patterns would be considered
correct now.
Original comment by mwojn...@gmail.com
on 24 Jan 2013 at 11:35
It's true that it wouldn't break any existing code, so there'd be no harm in
having it work in V0 too.
Original comment by re...@mrabarnett.plus.com
on 24 Jan 2013 at 2:05
Duplicate group names are allowed in regex 0.1.20130124.
Original comment by re...@mrabarnett.plus.com
on 24 Jan 2013 at 8:31
There is a minor issue when the same group is nested - the inner group
overrides the value matched by the outer group and both are present in the
result (2 copies of the same inner value). For example:
>>> match = regex.match(r'(?<x>a(?<x>b))', "ab")
>>> match.capturesdict()
{'x': ['b', 'b']}
Original comment by mwojn...@gmail.com
on 26 Jan 2013 at 12:09
Fixed in regex 0.1.20130126.
Original comment by re...@mrabarnett.plus.com
on 26 Jan 2013 at 11:39
Original issue reported on code.google.com by
mwojn...@gmail.com
on 23 Jan 2013 at 6:41