Open sentry-io[bot] opened 9 months ago
Nevada
This @mlissner isnt an issue per se in juriscraper. This is CL being allowed to redirect away because of recaptcha I think + a failure to recognize duplicates mixed in there as well.
https://www.courtlistener.com/opinion/9469503/abbott-v-city-of-henderson/ https://www.courtlistener.com/opinion/9469831/abbott-v-city-of-henderson/
if you look - the link to the backup to the court - links directly to a PDF for the JS filed extraction and both cases are added.
I opened a small commit to check the file type
https://github.com/freelawproject/courtlistener/pull/3696
@grossir could you add your thoughts.
@flooie I don't think the check should go in that actual part of the code:
The "js" extension is not taken directly from the response, but inferred using doctor in L137-L141 in cl_scrape_opionions/make_objects . get_extension
calls the extract_extension
view in doctor.
cf = ContentFile(content)
extension = get_extension(content)
file_name = trunc(item["case_names"].lower(), 75) + extension
opinion.file_with_date = cluster.date_filed
opinion.local_path.save(file_name, cf, save=False)
I think this would be the proper place to put the if clause
cf = ContentFile(content)
extension = get_extension(content)
if extension == ".js":
logger.error(...)
As to this specific error:
The "js" file we have saved in S3 is actually plain HTML. It is the case page from where we get the actual PDF url. The URL it has assigned leads to the actual PDF, though.
The weird part is that I ran the S3 HTML we have saved against this doctor view and got an .html
extension. I don't know why doctor is doing this, but an extra logger.error call will help in this and other issues
I noticed something different on the top of the page. It says: 24-02902: This document is currently unavailable. If you need a copy of this document, please contact Clerk's Office at (775)684-1600.
Now, we can trigger this error by deleting the linkID. or putting a wrong link id. For example:
Notice that it is the link ID and not any other part of the URL that controls what document is downloaded
Maybe it is an internal server error that is causing this? Still, I don't get why doctor infers a .js extension
Sentry Issue: COURTLISTENER-6GQ