freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
546 stars 151 forks source link

cl_find_citations changed the content of citations #1796

Open flooie opened 3 years ago

flooie commented 3 years ago

It's unclear how widespread this problem is but a user identified that

cl_find_citations - method that generates html_with_citations content converted

(47 O.S. 1991, § 11-903) to (47 Ohio St. 1991, § 11-903)

The first is a citation for a Oklahoma Statutes, the latter is an Ohio Court citation. At some point this fixed it self but we need to re-run cl_find_citations on the incorrect files to fix the html_with_citaitons.

This leaves us with a couple questions as to how this occurred, and how many other cases are affected.

flooie commented 3 years ago

This issue seems to be rather widespread.

I'm still investigating the origins of this but I've found dozens and dozens of examples in the Oklahoma jurisdictions at this point. Confusing the matter, is that the HTML with Citations changes the text of certain citations without generating hyperlinks. Something that I thought would be an indicator to look for.

But check this out. The first is a screen capture from the original PDF, the second from CL page holding displaying the HTML with Citations. Yikes.

image

image

I'm still sort of grappling with how to find these errors outside of Oklahoma. But I think we may need to rerun html with citations across all Oklahoma jurisdictions.

mlissner commented 3 years ago

The more you've investigated this, the more my memory has come into focus on what's going on here. I think in an earlier version of the citation finder, the thought was that it should clean up bad citations, making them better. So if it says A. 2d, the citator would fix it to A.2d.

I sort of thought that that code never landed, because of issues like this, but increasingly I'm remembering that it did, so I think this is a manifestation of that. The fix, like you say, is to re-run the danged thing. It'd be nice to do that in a focused way, but I bet we could run it across everything over the period of a week and put this behind us.

mlissner commented 2 years ago

We should verify if this is an ongoing problem and if it's fixed by re-running the citation code. If so, let's re-run the citation finder.

flooie commented 2 years ago

Well the good news ... is that cl_find_citaitons doesn't appear to mess around with the Oklahoma Statutes citations anymore. But the eye cite still identifies Oklahoma statutes as Ohio State Citations.

But... I think we need to delete and reprocess all the Oklahoma opinions html-with-citations and reprocess them all.

mlissner commented 2 years ago

Thanks for checking. If you want to give that a go via a Python terminal, go for it. Otherwise, we'll get to it eventually. No need to delete the content though, it'll just get updated when the citation finder is run. We could do that now or just let it happen naturally when we're ready to do so. I imagine there have been a LOT of changes to eyecite and the reporters-db that warrant a re-run of the citation finder across everything.

We could do that now, or it might be wise to wait until more fixes land in reporters-db.

flooie commented 2 years ago

That makes sense to me @mlissner