arshaw / scrapemark

Super-convenient web scraping in Python
96 stars 28 forks source link

ValueError in _substitute_entity() substituting '#x201C' like strings #11

Open arshaw opened 13 years ago

arshaw commented 13 years ago

Reported by uglydog....@gmail.com, Oct 29, 2010

What steps will reproduce the problem?

  1. when m.group(0) == '#x201C' in _substitute_entity().
  2. unichr(int(ent)) (where ent=='x201C') throws ValueError.

What is the expected output? What do you see instead? unichr() wants integer 0x201C.

What version of the product are you using? On what operating system? scrapemark-0.9-py2.5.egg Python 2.6.4 Ubuntu 9.10 x64

Please provide any additional information below.

adding this function:

def my_int(s):
        try: return int(s)
        except: pass
        try: return int(s, 16)
        except: pass

        if len(s)>0 and s[0].lower() == 'x':
                try: return int('0'+s, 16)
                except: pass

        return 0

and substitute:
  unichr(int(ent)) with  unichr(my_int(ent))

seems to fix the problem.

quink commented 13 years ago

Probably fixed in #9.