dragnet-org / dragnet

Just the facts -- web page content extraction
MIT License
1.25k stars 179 forks source link

malformed encoding leads to `BlockifyError` #92

Open bdewilde opened 5 years ago

bdewilde commented 5 years ago

I'm occasionally getting BlockifyError s caused by malformed encoding values set here. Here's the tail of the traceback:

Traceback (most recent call last):
    File "dragnet/blocks.pyx", line 846, in dragnet.blocks.Blockifier.blockify
    File "src/lxml/parser.pxi", line 1689, in lxml.etree.HTMLParser.__init__ 
    File "src/lxml/parser.pxi", line 823, in lxml.etree._BaseParser.__init__
    LookupError: unknown encoding: 'b'UTF-8,''

Looks like there's a trailing comma on "UTF-8", plus it's been incorrectly converted into unicode — possibly by calling str(b"UTF-8") instead of b"UTF-8".decode("utf-8").

I wasn't able to track down a relevant bug in blocks.pyx, so maybe this is just messy web data and 🤷‍♂ . Just posting in case somebody knows what's up!

pakelley commented 5 years ago

Huh, that's pretty odd. Do you have an example page you can share that causes this? At a glance, I don't see anything that would cause it, but I'd be curious to poke around and see what's up with that.