freelawproject / recap

This repository is for filing issues on any RECAP-related effort.
https://free.law/recap/
12 stars 4 forks source link

Docket sheet texts not updating on CourtListener #297

Closed StanleyBolten closed 3 years ago

StanleyBolten commented 3 years ago

I have noticed after I go to the docket sheet on PACER for this: https://www.courtlistener.com/docket/4304407/united-states-v-hill/?page=2

There are two issues I need to report for CourtListener and RECAP tool.

First of tii it says "Date of Last Known Filing: May 10, 2021 "

However I used both History area to update the docket sheet and HTML Docket sheet and both did not update this.

Then the text is not showing up for the last two docket entries.

Shows: 282

Aug 17, 2021

Main Doc­ument

USCA Order Re Petition for Rehearing

283

Aug 17, 2021

Main Doc­ument

USCA Order Re Petition for Rehearing

It won't show the docket text even after repeatedly trying to access the docket sheet for this case.

It used to be I went to the docket sheet and then it shown the full text for each document, the text that tells what the Document is about, and now it is not showing up for the two recent documents. I can see it on PACER but not CourtListener. I thought I'd report this problem. Sent an email to Mike Lissner and haven't heard back from him. So I am reporting this issue here.

flooie commented 3 years ago

@StanleyBolten - can you check again?

flooie commented 3 years ago

Definitely seems off for date of last known filing though. I do see the files on 8/17. Unless I'm misunderstanding your comment.

johnhawkinson commented 3 years ago

Bill: The issue is that the first two items at https://www.courtlistener.com/docket/4304407/united-states-v-hill/?filed_after=&filed_before=&entry_gte=&entry_lte=&order_by=desc are:

283 Aug 17, 2021    Main Doc­ument  USCA Order Re Petition for Rehearing
282 Aug 17, 2021    Main Doc­ument  USCA Order Re Petition for Rehearing

Which are the short summary descriptions that appear to be from the RSS feed. Running the actual docket report (presumably), there is full docket text analogous to what we see for 269:

269 Nov 17, 2020    USCA Order denying the Petition for Rehearing and Rehearing En Banc re: 203 Notice of Appeal Without Fee Payment. USCA Case #19-4758. (Daniel, J) (Entered: 11/17/2020)
Main Document   USCA Order Re Petition for Rehearing

So the docket parser is presumably not parsing it right. It'd be helpful to have the HTML to check, of course.

mlissner commented 3 years ago

Well, I went and attempted to RECAP the item by hitting the "View on PACER" button on the CL link, and it seems like it got RECAPed OK.

Looking in the API logs:

https://www.courtlistener.com/api/rest/v3/recap/?pacer_case_id=64541&order_by==date_created&upload_type=1

There are a few failed uploads for this:

https://www.courtlistener.com/api/rest/v3/recap/ https://www.courtlistener.com/api/rest/v3/recap/5272246/

Those both report status of 4, which means, "Item is currently being processed." Usually when that status gets stuck, that means the item crashed while processing.

Taking a step further into this issue, I pulled the associated HTML from these dockets, attached. It crashed the docker parser, but I haven't had time to figure out why yet. If anybody else wants to take a look, I won't stop you (but chime in so we don't both do it).

Here's the stack trace:

In [7]: from juriscraper.pacer import DocketReport

In [8]: report = DocketReport('ncmd')

In [9]: with open(pq.filepath_local.path, 'r') as f:
   ...:     text = f.read()
   ...: 
   ...: 

In [10]: report._parse_text(text)

In [11]: report.data
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-11-f54b48b7fd72> in <module>
----> 1 report.data

/usr/local/lib/python3.8/site-packages/juriscraper/pacer/docket_report.py in data(self)
     66             return {}
     67 
---> 68         data = self.metadata.copy()
     69         data["parties"] = self.parties
     70         data["docket_entries"] = self.docket_entries

/usr/local/lib/python3.8/site-packages/juriscraper/pacer/docket_report.py in metadata(self)
    414             return self._metadata
    415 
--> 416         self._set_metadata_values()
    417         data = {
    418             "court_id": self.court_id,

/usr/local/lib/python3.8/site-packages/juriscraper/pacer/docket_report.py in _set_metadata_values(self)
   1119     def _set_metadata_values(self):
   1120         # The first ancestor table of the table cell containing "date filed"
-> 1121         table = self.tree.xpath(
   1122             # Match any td containing Date [fF]iled
   1123             '//td[.//text()[contains(translate(., "f", "F"), "Date Filed:")]]'

IndexError: list index out of range

5266769.txt

johnhawkinson commented 3 years ago

By the time this makes it to 1121, tostring(self.tree.xpath) is merely b'<div><body></body></div>', so it's not surprising that it cannot find the upper metadata table.

This seems to be a consequence of strip_bad_html_tags_insecure()'s excessively larger hammer. Changing scripts=True to scripts=False causes the Cleaner to no longer destroy the tree, but then the docket report parser still fails, apparently because the way you're invoking lxml causes it to get lost in scripts.

I don't understand why this should be the case since invoking lxml naively works just fine:

>> import lxml
>>> f=open("/var/tmp/d.html")
>>> t=f.read()
>>> import lxml.html
>>> tree=lxml.html.fromstring(t)
>>> len(tree)
2
>>> tree
<Element html at 0x101159720>
>>> list(tree)
[<Element head at 0x10116b860>, <Element body at 0x1017b5e50>]
>>> list(tree.body)
[<Element iframe at 0x1017b5f90>, <Element div at 0x100eff950>, <Element script at 0x101822040>, <Element div at 0x101822090>]
>>> len(tree.findall('td'))
0
>>> len(tree.findall('//td'))
>>> len(tree.findall('.//td'))
48

I don't have the energy to fight with this tonight, but all this hair around and attempts to "clean up" things before and after invoking the HTML parser are breaking it badly. But I've been saying that for years…

It might be easier to whack on the input HTML until it gets better and see what the difference is.

StanleyBolten commented 3 years ago

Thank you for fixing the issues. Now the docket text is showing up. I am using a PACER account where I don't want to rack up too much going back to the docket sheet five or six times trying to fix a issue I couldn't fix because I don't work there at Courtlistener.

I am grateful that the issue had gotten fixed.

282

Aug 17, 2021

USCA ORDER denying the petition for rehearing and rehearing en banc. No judge requested a poll under Fed. R. App. P. 35 on the petition for rehearing en banc as to BRIAN DAVID HILL re: 270 Notice of Appeal Without Fee Payment. USCA Case #20-7737. (Daniel, J) (Entered: 08/17/2021)

Main Doc­ument

USCA Order Re Petition for Rehearing

283

Aug 17, 2021

USCA Order denying the petition for rehearing and rehearing en banc. No judge requested a poll under Fed. R. App. P. 35 on the petition for rehearing en banc as to BRIAN DAVID HILL re: 226, 238 Notice of Appeals Without Fee Payment. (Civil Action 1:17CV1036) USCA Case Nos. 19-7755(L) and 20-6034. (Daniel, J) (Entered: 08/17/2021)

Main Doc­ument

USCA Order Re Petition for Rehearing

There it was pasted from Courtlistener. So now it is actually updated to what was already accessed from the docket sheet multiple times. Thank you for fixing this.

The last little blip that isn't that important, it isn't life or death, but the only issue left is the wrong last filing date.

Date of Last Known Filing: May 10, 2021

Other than that, the docket text issue has been resolved.

Thank You!!!!!!

StanleyBolten commented 3 years ago

FireShot Capture 1845 - United States v  HILL, 1_13-cr-00435 – CourtListener com_ - www courtlistener com Here is the proof. It was fixed and actually shows the docket text of the last two documents. So now it is corrected.

StanleyBolten commented 3 years ago

It was very important that the issue had been fixed. I am grateful to the RECAP CourtListener team. I been writing a recent article about what the Fourth Circuit had done and linking to the docket sheet is the most important thing I have right now. CourtListener is the most important viral source and allows the average people to access PACER docket sheets and documents, well technically a copy of it all without the fees. So now people have a better chance at investigating the Federal Court files and be able to debate and discuss this stuff. It couldn't have happened without the hard work of your team.

Thank you all and God bless you for your work.

StanleyBolten commented 3 years ago

There is another docket issue.

https://www.courtlistener.com/docket/60273358/united-states-v-shroyer/

I did both the History/Documents and Docket Sheet HTML and it is not adding the docket text to that either.

In addition to that, I RECAp'ed the Document 6 and it said multiple times PDF Uploaded to the RECAP Archive and yet Document 6 is still saying to buy the Document and is not being RECAP'ed. I thought I'd report on this issue as well.

So I get a notification that it was  RECAP'ed but it will not add Document 6 while other Documents are uploading there with no problems. The docket text is again not being added. This is from PACER in the Western District of Texas Cm/ECF.

johnhawkinson commented 3 years ago

@StanleyBolten: I think we may have mislead you. This problem is not fixed.

Thank you for fixing the issues. Now the docket text is showing up.

To the extent that it was repaired, it was done as a one-off. We've made some progress identifying the problem but it has not been fixed. I'd imagine that whatever situation causes this problem will continue to be an issue until we repair it.

That said, it would be helpful for you to offer more specificity in the bug report

I did both the History/Documents and Docket Sheet HTML and it is not adding the docket text to that either.

That is, which docket entries should have different text? That said, in this case it appears none of them have the full text, so there's not much ambiguity.

In addition to that, I RECAp'ed the Document 6 and it said multiple times PDF Uploaded to the RECAP Archive and yet Document 6 is still saying to buy the Document and is not being RECAP'ed. I thought I'd report on this issue as well.

Document 6 is there now. Perhaps the server was slow to process it. It appears from the logs it took 3 minutes, which is certainly longer than usual:

            "date_created": "2021-08-23T14:04:02.405160-07:00",
            "date_modified": "2021-08-23T14:07:03.601884-07:00",
mlissner commented 3 years ago

@StanleyBolten, I took a closer look at the HTML from this upload and it's riddled with JS from other extensions you have installed. Can you try to do this again without those extensions? I'll email you the ones that seem to be involved so they're not posted publicly. I think the reason you're having issues is because those extensions inject JS that messes up our parsers.

We have code that blocks uploads like these because those extensions often will monkey with the page in bad ways (like there's one that changes every instance of "Trump" to something else), but it seems like we don't need to do that here since these extensions simply break our parsers outright.

mlissner commented 3 years ago

Closing for now, since I think we've found the underlying cause, but if disabling the extensions doesn't fix this, let's reopen.