freelawproject / recap

This repository is for filing issues on any RECAP-related effort.
https://free.law/recap/
12 stars 4 forks source link

Do not send "Select a Person" page to CL #362

Open sentry-io[bot] opened 8 months ago

sentry-io[bot] commented 8 months ago

I think this is a parse failure for a PACER docket. Can we take a look and see if a tweak makes sense?

Sentry Issue: COURTLISTENER-5DS

_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/courtlistener/cl/recap/tasks.py", line 898, in parse_case_query_page_text
    return report.data
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/juriscraper/pacer/case_query.py", line 308, in data
    data = self.metadata.copy()
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/juriscraper/pacer/case_query.py", line 138, in metadata
    [rows[0].find(".//font").text_content()]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text_content'
"""
AttributeError: 'NoneType' object has no attribute 'text_content'
(9 additional frame(s) were not displayed)
...
  File "cl/recap/views.py", line 65, in perform_create
    await asyncio.shield(recap_upload_task)
  File "cl/recap/tasks.py", line 133, in process_recap_upload
    docket = await process_case_query_page(pq.pk)
  File "cl/recap/tasks.py", line 925, in process_case_query_page
    data = await asyncio.get_running_loop().run_in_executor(
grossir commented 8 months ago

I think this is not a CaseQuery page. Case Query pages have a table-like "header", and the erroring document has not. From the test cases, 2 examples of CaseQuery pages: image

This HTML page (s3) seems like a Case Query Advanced, specifically a "Parties" page. juriscraper is not prepared to parse this yet. But, apart from that, CL called the wrong parser (EDIT: I see that we do not support this kind of pages yet)

def process_recap_case_query_result_page(self, pk):
    """Process case query result pages.

    For now, this is a stub until we can get the parser working properly in
    Juriscraper.
    """

image

mlissner commented 8 months ago

OK, so this is more of a RECAP extension bug. We shouldn't be sending the "Select a Person" page to CL in the first place. If it were a useful page, I'd say we should add support for it, but since it's not, yeah, we can just make sure the extension doesn't send it.

I'll refile this issue over in the recap repo.