freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
532 stars 147 forks source link

Issues parsing 2nd/9th Circuit cases #2421

Open andrewbaker00 opened 1 year ago

andrewbaker00 commented 1 year ago

Hello, over the past few days I've run into an issue querying for 9th Circuit cases (2nd circuit and 9th BAP as well; other circuits are working just fine). It looks like a new redirection page is not being handled correctly.

Running this:

    court_id="9ca"
    docket_number="22-650"
    session = PacerSession(username="<username>", password="<password>")
    session.login()
    report = AppellateDocketReport(court_id, session)

    report.query(docket_number)

Results in this error:

Juriscraper will continue to run, and all logs will be sent to stderr.
2022-12-19 15:50:25,521 - INFO: Attempting PACER API login
2022-12-19 15:50:26,544 - INFO: New PACER session established.
2022-12-19 15:50:26,544 - INFO: Querying appellate docket report for docket number '22-650' with params {'servlet': 'CaseSummary.jsp', 'caseNum': '22-650', 'incDktEntries': 'Y', 'incPtyAty': 'Y', 'fullDocketReport': 'Y', 'actionType': 'Run+Docket+Report'}
2022-12-19 15:50:33,921 - INFO: Invalid/expired PACER session. Establishing new session.
2022-12-19 15:50:33,922 - INFO: Attempting PACER API login
2022-12-19 15:50:35,115 - INFO: New PACER session established.
Traceback (most recent call last):
  File "/Users/andrew/dev/docket/python/pacer/main.py", line 63, in <module>
    query_case(docket_number, court_id, outfile, get_docket_entries, get_parties, get_lower_court, date_start)
  File "/Users/andrew/dev/docket/python/pacer/main.py", line 33, in query_case
    data['metadata'] = report.metadata
  File "/usr/local/lib/python3.9/site-packages/juriscraper/pacer/appellate_docket.py", line 343, in metadata
    "case_name": self._get_case_name(),
  File "/usr/local/lib/python3.9/site-packages/juriscraper/pacer/appellate_docket.py", line 615, in _get_case_name
    case_name = self.tree.xpath(path)[0].text_content()
IndexError: list index out of range

If you inspect the response from pacer it looks like this:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8" />
            </head>
    <body onload="document.forms[0].submit()">
        <noscript>
            <p>
                <strong>Note:</strong> Since your browser does not support JavaScript,
                you must press the Continue button once to proceed.
            </p>
        </noscript>

        <form action="https&#x3a;&#x2f;&#x2f;ca9-showdoc.azurewebsites.us&#x2f;Saml2&#x2f;Acs" method="post">
            <div>
                <input type="hidden" name="RelayState" value="vBGZJQKeTnZRiK67sW7q5XYU"/>                

                <input type="hidden" name="SAMLResponse" value="<REMOVED>"/>                
            </div>
            <noscript>
                <div>
                    <input type="submit" value="Continue"/>
                </div>
            </noscript>
        </form>
            </body>
</html>

I don't have too much experience with juriscraper, wondering if this happens somewhere else in pacer.

mlissner commented 1 year ago

I don't think this is anything we changed, since we haven't been working in this area of the code. That means it's likely something the court started doing.

From the HTML you posted, it looks like we need to add another step in our download to click the "continue" button, like it says.

That might not be too hard, but we don't use the query method in use in production, so I don't think writing for this will be a priority for us, at least not immediately.

Do you want to take a stab at it? It shouldn't be that hard if you open the AppellateDocketReport object and look at the query method.