freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
545 stars 151 forks source link

More extraction problems: ****Error: Unable to extract content due to unknown extension, extracting text from js: 9927669 -... #3695

Open sentry-io[bot] opened 9 months ago

sentry-io[bot] commented 9 months ago

Sentry Issue: COURTLISTENER-6GQ

****Error: Unable to extract content due to unknown extension, extracting text from js: 9927669 - Abbott v. City of Henderson****
flooie commented 9 months ago

Nevada

flooie commented 9 months ago

This @mlissner isnt an issue per se in juriscraper. This is CL being allowed to redirect away because of recaptcha I think + a failure to recognize duplicates mixed in there as well.

https://www.courtlistener.com/opinion/9469503/abbott-v-city-of-henderson/ https://www.courtlistener.com/opinion/9469831/abbott-v-city-of-henderson/

if you look - the link to the backup to the court - links directly to a PDF for the JS filed extraction and both cases are added.

flooie commented 9 months ago

I opened a small commit to check the file type

https://github.com/freelawproject/courtlistener/pull/3696

@grossir could you add your thoughts.

grossir commented 9 months ago

@flooie I don't think the check should go in that actual part of the code:

The "js" extension is not taken directly from the response, but inferred using doctor in L137-L141 in cl_scrape_opionions/make_objects . get_extension calls the extract_extension view in doctor.

    cf = ContentFile(content)
    extension = get_extension(content)
    file_name = trunc(item["case_names"].lower(), 75) + extension
    opinion.file_with_date = cluster.date_filed
    opinion.local_path.save(file_name, cf, save=False)

I think this would be the proper place to put the if clause

    cf = ContentFile(content)
    extension = get_extension(content)
    if extension == ".js":
          logger.error(...)

As to this specific error:

  1. The "js" file we have saved in S3 is actually plain HTML. It is the case page from where we get the actual PDF url. The URL it has assigned leads to the actual PDF, though.

  2. The weird part is that I ran the S3 HTML we have saved against this doctor view and got an .html extension. I don't know why doctor is doing this, but an extra logger.error call will help in this and other issues

  3. I noticed something different on the top of the page. It says: 24-02902: This document is currently unavailable. If you need a copy of this document, please contact Clerk's Office at (775)684-1600. Now, we can trigger this error by deleting the linkID. or putting a wrong link id. For example:

https://caseinfo.nvsupremecourt.us/document/view.do?csNameID=63623&csIID=63623&deLinkID=991002&onBaseDocumentNumber=24-02902

Notice that it is the link ID and not any other part of the URL that controls what document is downloaded

https://caseinfo.nvsupremecourt.us/document/view.do?csNameID=63623&csIID=63623&deLinkID=932002&onBaseDocumentNumber=24-02902

https://caseinfo.nvsupremecourt.us/document/view.do?csNameID=63623&csIID=63623&deLinkID=931002&onBaseDocumentNumber=24-02902

Maybe it is an internal server error that is causing this? Still, I don't get why doctor infers a .js extension