Jdharden / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

jython re expression doesn't work #42

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I'm trying to transform BWV 1 — Wie schön leuchtet der Morgenstern, BWV 1 in 
a cell to Wie schön 
leuchtet der Morgenstern using the following Jython function:

import re
v = cell["value"]
g = re.search(r"""— (.*),\s*BWV""",v)
return g.group(1)

which, alas, returns null

However, in Jython 2.5.1, the following code works

# -*- coding: utf-8 -*-

import re
#v = cell["value"]
v = "BWV 1 — Wie schön leuchtet der Morgenstern, BWV 1"
g = re.search(r"""— (.*),\s*BWV""",v)
print g.group(1)

I'm using GW  Version 1.0.1-r732

Original issue reported on code.google.com by raymond....@gmail.com on 16 May 2010 at 9:16

GoogleCodeExporter commented 9 years ago
I've managed to narrow it down to the special hyphen character in the regex, 
but I'm 
not yet sure why that causes the code to fail. The actual exception thrown is 
this:

Traceback (most recent call last):
  File "<string>", line 5, in ___temp___
  File "/home/vishal/Workspace/metaweb/freebase-gridworks-read-
only/lib/jython/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
  File "/home/vishal/Workspace/metaweb/freebase-gridworks-read-
only/lib/jython/re.py", line 241, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat

I'm still investigating.

Original comment by rawl...@gmail.com on 17 May 2010 at 3:05

GoogleCodeExporter commented 9 years ago
Raymond, I found that this works

import re
g = re.search(ur"\u2014 (.*),\s*BWV", value)
return g.group(1)

Could you check? Note the unicode character as well as the ur prefix. 
Unfortunately, I don't know a quick way 
to encode unicode characters within jython code without writing a parser for 
it. So right now you'd have to do 
the encoding yourself. I did this by first using GEL on the expression 
"—".unicode(), which gives 8212, and 
then converting that from decimal to hex.

Original comment by dfhu...@gmail.com on 17 May 2010 at 4:51