freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
379 stars 111 forks source link

Consider switching to html5parser for all parsing #97

Open mlissner opened 9 years ago

mlissner commented 9 years ago

lxml has an html5parser that can handle some of the inanities that bad HTML pages present.

For example, this page:

http://media.ca11.uscourts.gov/opinions/unpub/logname.php?begin=9720&num=485&numBegin=1

Has random less than signs in some of the docket numbers, which results in a terrible HTML tree. I was able to solve that in ca11_u, which fixes the problem and even preserves the API for the scrapers.

The hard part of this is that the html5parser's fromstring function returns _Element objects while the html.fromstring function returns HtmlElements. I was able to get around this in ca11_u with something like:

from lxml.html import fromstring, tostring
from lxml.html import html5parser

e = html5parser.fromstring(text)
html_element = fromstring(tostring(e))

Though that involves an extra parse and an extra serialization, all of which sucks and obscures.

mlissner commented 9 years ago

See also: http://stackoverflow.com/questions/33134590/convert-lxml-element-to-htmlelement

arderyp commented 8 years ago

Would this be as simple as changing the _make_html_tree() method in AbstractSite, or would ding so require auditing the scrapers for resulting breakages? In other words, are the interfaces/functions/methods of lxml.html.fromstring(text) objects the same as those of lxml.html.html5parser.fromstring(text) objects?

mlissner commented 8 years ago

I think the interface is mostly the same, but they generate different trees, so the parsing and XPaths would be different. I think it would mostly work to just swap them one day, but I'm about 90% sure some things would quietly break. I think it's something we'd want to do incrementally.