Open mlissner opened 9 years ago
Would this be as simple as changing the _make_html_tree() method in AbstractSite, or would ding so require auditing the scrapers for resulting breakages? In other words, are the interfaces/functions/methods of lxml.html.fromstring(text) objects the same as those of lxml.html.html5parser.fromstring(text) objects?
I think the interface is mostly the same, but they generate different trees, so the parsing and XPaths would be different. I think it would mostly work to just swap them one day, but I'm about 90% sure some things would quietly break. I think it's something we'd want to do incrementally.
lxml
has an html5parser that can handle some of the inanities that bad HTML pages present.For example, this page:
http://media.ca11.uscourts.gov/opinions/unpub/logname.php?begin=9720&num=485&numBegin=1
Has random less than signs in some of the docket numbers, which results in a terrible HTML tree. I was able to solve that in ca11_u, which fixes the problem and even preserves the API for the scrapers.
The hard part of this is that the html5parser's
fromstring
function returns_Element
objects while thehtml.fromstring
function returnsHtmlElement
s. I was able to get around this in ca11_u with something like:Though that involves an extra parse and an extra serialization, all of which sucks and obscures.