freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
376 stars 111 forks source link

Invalid XML character break docket parsers #348

Open cgdeboer-toptal opened 4 years ago

cgdeboer-toptal commented 4 years ago

Summary

When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.

This is not a hypothetical, I was scraping a docket at the Ohio Northern Bankruptcy Court (ohnb), and the docketreport.parse() failed because of some invalid XML characters coming back from the request.

Tasks

Questions

mlissner commented 4 years ago

Nice find. We've seen this before in other areas, so it's not surprising to see it here too. I did some performance testing on this a while back:

https://stackoverflow.com/a/25920392/64911

The code that's in CL to handle this is:

def filter_invalid_XML_chars(input):
    """XML allows:

       Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

    This strips out everything else.

    See: http://stackoverflow.com/a/25920392/64911
    """
    if isinstance(input, str):
        # Only do str, unicode, etc.
        return re.sub(
            "[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD"
            "\U00010000-\U0010FFFF]+",
            "",
            input,
        )
    else:
        return input

I'd definitely welcome a PR for this.

cgdeboer-toptal commented 4 years ago

https://stackoverflow.com/a/25920392/64911

I did some reading on this earlier today before reading your post, and stumbled upon the same SO post... I should have looked more closely at the author. Will work on this over the weekend.

cgdeboer-toptal commented 4 years ago

@mlissner thanks for the link. I've been doing a little digging on this and haven't found a solution that works quite yet. I've got a sample text file with the bad payload from the docket. I'm examining the way html5lib parses characters.

I'm still working through this.

Side note, I see the code in CL, but I'm not seeing where it is used anywhere in that repo

mlissner commented 4 years ago

Weird, yeah, looks like it's not used anymore. I suppose we could delete it since it's easy to find again on StackOverflow.

Do you need help with your progress? Sounds like you're just checking in, but if you're frustrated maybe somebody can take a look.

johnhawkinson commented 4 years ago

Not just to be contrarian, but I have long been convinced the StackOverflow post does not offer the right solution. Is there a test case available?

cgdeboer-toptal commented 4 years ago

That's sort of what I'm finding @johnhawkinson. I'll post a PR with the failing test case.

cgdeboer-toptal commented 4 years ago

The traceback on this goes back to a character parsed by html5lib, where it attempts to insert a disallowed character into the tree.

{'type': 1, 'data': '\x03'}

PR: https://github.com/freelawproject/juriscraper/pull/349

mlissner commented 4 years ago

Not just to be contrarian, but I have long been convinced the StackOverflow post does not offer the right solution. Is there a test case available?

Can you elaborate?