Avoiding overlapping citation extents in find_citations.py

slbayer commented 4 years ago

As I described in issues 1338 and 1344, I'm using a portion of the CourtListener code to write a standalone citation finder. This component is an NLP mention finder, so the extent of the component turns out to be important. I know that this isn't the intention of the citation finder in CourtListener, but in some cases the goals may overlap. I'd be surprised if this issue describes one of them, but @mlissner invited me to submit the issue, so here goes.

The general strategy in find_citations.py is to search through the list of tokens in the document and look for "anchors" for a citation: a reporter for the full and short citations, "Id." and "Ibid." and "supra" for other cases, and the sigma for non-opinion citations. Once the anchor is found, functions dedicated to the individual full, short and supra types are called to "build out" the citation to the left and right to capture the relevant information.

This is a pretty clever strategy, and I haven't changed it in my version of the code. However, because of the way it's written, it's possible to end up with citations which overlap in token extent with the citations to the left and right of it. For my purposes, this is pretty disastrous; for yours, probably not so much, although there may be cases where it leads to the wrong peripheral information, eventually.

The solution I've come up with requires a major refactor of the code, because after each citation is found, I need to see whether it overlaps with its predecessor, and in some circumstances it will result in my rebuilding the citations with start and/or end token limits which can't be exceeded, or simply discarding the citation entirely.

A particularly perverse example is:

Reeves v. Sanderson Plumbing Prods., 530 U.S. 133, 148, 120 S.Ct. 2097,147 L.Ed.2d 105 (2000).

Note the lack of a space in 2097,147 (this may be an artifact of extraction with Apache Tika, or it may be in the original, I haven't checked). The impact of this is that there are three separate reporters, U.S., S.Ct. and L.Ed.2d, all of which grab this entire text sequence.

In my solution, I'm keeping track of the start and end tokens for each citation, and so I know when I hit this problem, and one thing I've done is introduce an additional notion of the minimal start and end token, so that reanalyses have the option of ignoring some peripheral information. So, e.g., in the case above, I can reanalyze the citation anchored on the U.S. reporter to exclude everything after 133, (obviously, it should after 148, but that would require a more sophisticated page parser, which I haven't tackled yet). So the citation anchored on the S.Ct. reporter starts at 148. Unfortunately, because of that missing space in 2097,147, there's no way to create two citations out of the remainder of the string, and one of them ends up being dropped.

It should be obvious from this description that find_citations.py would need to be doing a lot more, and a lot more different, work, and it's not clear, as I said, that it matters for your purposes. I report this for the sake of completeness.

mlissner commented 4 years ago

Hm, so this is the crux of it, right?

In [10]: s = 'Reeves v. Sanderson Plumbing Prods., 530 U.S. 133, 148, 120 S.Ct. 2097,147 L.Ed.2d 105 (2000). '
In [11]: cites = get_citations(s)
In [13]: for cite in cites:
    print('Vol.: %s' % cite.volume)
    print('Rep.: %s' % cite.reporter)
    print('Pg.: %s' % cite.page)
   ....:     
Vol.: 530
Rep.: U.S.
Pg.: 133
Vol.: 120
Rep.: S. Ct.
Pg.: 2097147  # <--- OOOF!
Vol.: 2097147
Rep.: L. Ed. 2d
Pg.: 105

mlissner commented 4 years ago

If that's the extent of it, I guess we need to think again about whether numbers with commas in them are numbers at all.

mlissner commented 4 years ago

Copying @mattdahl, in hopes he's interested in this too.

slbayer commented 4 years ago

One of the problems with pattern-matching approaches, as opposed to something more statistical, is that it's pretty much all or nothing when you find an edge case. I don't know whether you'll ever encounter a situation where a comma in a number is still a number for you, but clearly, sometimes it is and sometimes it isn't, and it's hard to account for typos, etc. Another example of this is that there are some reporters which have a variant which is a person's name, and when this person is the plaintiff, and the name is preceded by a number (e.g., a page number, footnote number, etc.), " v" will be recognized as a citation:

37 Taylor v. Bd. of Educ., 240 Fed. Appx. 717 (6th Cir. 2007)

You'll also catch the real reference, but you'll also hallucinate one.

Every approach has some level of error :-).

mlissner commented 4 years ago

Yeah, corner cases always abound. In your latest example, the page is being caught as v because of roman numerals. Not sure what we can do about that one, because the roman numerals can be uppercase or lowercase.

In [2]: s = '37 Taylor v. Bd. of Educ., 240 Fed. Appx. 717 (6th Cir. 2007)'

In [3]: cites = get_citations(s)
DEBUG (0.004) SELECT "search_court"."citation_string", "search_court"."id" FROM "search_court" ORDER BY "search_court"."position" ASC; args=()

In [4]: cites
Out[4]: 
[37 Taylor v Bd. of Educ., 240 Fed. Appx. 717 (ca6 2007),
 Taylor v. Bd. of Educ., 240 Fed. Appx. 717 (ca6 2007)]

In [5]: for cite in cites:                                                
    print('Vol.: %s' % cite.volume)
    print('Rep.: %s' % cite.reporter)
    print('Pg.: %s' % cite.page)
   ...:     
Vol.: 37
Rep.: Taylor
Pg.: v # <-- oops?
Vol.: 240
Rep.: Fed. Appx.
Pg.: 717

Keep the examples coming though, you're definitely finding areas of refinement.

brianwc commented 4 years ago

We could check my idea by searching through the Indigo Book, but I don't think there's ever a legitimate citation to a court reporter with a comma in the number. Thus, maybe the "fix" is for this code to ensure that every comma is always followed by a space before it tries to parse the citations.

Alternatively, CL could run a script that looks for this edge case through the whole corpus and just corrects it by adding a space after every comma within a citation (where there isn't already one), but then we'd need to also ensure the error doesn't creep back in as new documents are added.

On Fri, Jul 24, 2020 at 7:55 AM Mike Lissner notifications@github.com wrote:

Yeah, corner cases always abound. In your latest example, the page is being caught as v because of roman numerals. Not sure what we can do about that one, because the roman numerals can be uppercase or lowercase.

In [2]: s = '37 Taylor v. Bd. of Educ., 240 Fed. Appx. 717 (6th Cir. 2007)'

In [3]: cites = get_citations(s) DEBUG (0.004) SELECT "search_court"."citation_string", "search_court"."id" FROM "search_court" ORDER BY "search_court"."position" ASC; args=()

In [4]: cites Out[4]: [37 Taylor v Bd. of Educ., 240 Fed. Appx. 717 (ca6 2007), Taylor v. Bd. of Educ., 240 Fed. Appx. 717 (ca6 2007)]

In [5]: for cite in cites: print('Vol.: %s' % cite.volume) print('Rep.: %s' % cite.reporter) print('Pg.: %s' % cite.page) ...: Vol.: 37 Rep.: Taylor Pg.: v # <-- oops? Vol.: 240 Rep.: Fed. Appx. Pg.: 717

Keep the examples coming though, you're definitely finding areas of refinement.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/freelawproject/courtlistener/issues/1349#issuecomment-663579513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPKOKYUMTD3UZU5SPLB2DR5GOHBANCNFSM4PGETVTA .

mlissner commented 4 years ago

We should be able to just split on commas. I think that'd be a pretty easy fix. Worth checking indigo book or our own data for this pattern first though, I agree. And gotta figure out roman numerals maybe.

mlissner commented 3 years ago

@slbayer we have been pulling the citation finder out of CourtListener, and it now lives in this repository. It has undergone a lot of change lately to make it faster and more reliable.

I'm not sure if this bug still exists, but if you wanted to, you could add a few tests cases here. If they still failed, you can set the expect_fail flag to True, and then check them in. (This is another new feature of eyecite.) With failing tests checked in, we'd be able to at least keep an eye on this.

jcushman commented 3 years ago

This initial edge case of Reeves v. Sanderson Plumbing Prods., 530 U.S. 133, 148, 120 S.Ct. 2097,147 L.Ed.2d 105 (2000). now appears to work correctly, I think because we don't assume that page numbers can contain commas by default:

In [1]: from eyecite import *

In [2]: get_citations("Reeves v. Sanderson Plumbing Prods., 530 U.S. 133, 148, 120 S.Ct. 2097,147 L.Ed.2d 105 (2000).")
Out[2]:
[FullCaseCitation(token=CitationToken(data='530 U.S. 133', start=37, end=49, volume='530', reporter='U.S.', page='133', exact_editions=(Edition(reporter=Reporter(short_name='U.S.', name='United States Supreme Court Reports', cite_type='federal', is_scotus=True), short_name='U.S.', start=datetime.datetime(1875, 1, 1, 0, 0), end=None),), variation_editions=(), short=False, extra_match_groups={}), index=5, reporter='U.S.', page='133', volume='530', canonical_reporter='U.S.', extra='148, 120 S.Ct. 2097 , 147 L.Ed.2d 105', defendant='Sanderson Plumbing Prods.,', plaintiff='Reeves', court='scotus', year=2000, reporter_found='U.S.', exact_editions=(Edition(reporter=Reporter(short_name='U.S.', name='United States Supreme Court Reports', cite_type='federal', is_scotus=True), short_name='U.S.', start=datetime.datetime(1875, 1, 1, 0, 0), end=None),), variation_editions=(), all_editions=(Edition(reporter=Reporter(short_name='U.S.', name='United States Supreme Court Reports', cite_type='federal', is_scotus=True), short_name='U.S.', start=datetime.datetime(1875, 1, 1, 0, 0), end=None),), edition_guess=Edition(reporter=Reporter(short_name='U.S.', name='United States Supreme Court Reports', cite_type='federal', is_scotus=True), short_name='U.S.', start=datetime.datetime(1875, 1, 1, 0, 0), end=None)),
 FullCaseCitation(token=CitationToken(data='120 S.Ct. 2097', start=56, end=70, volume='120', reporter='S.Ct.', page='2097', exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='S. Ct.', name="West's Supreme Court Reporter", cite_type='federal', is_scotus=True), short_name='S. Ct.', start=datetime.datetime(1882, 1, 1, 0, 0), end=None),), short=False, extra_match_groups={}), index=8, reporter='S. Ct.', page='2097', volume='120', canonical_reporter='S. Ct.', extra='147 L.Ed.2d 105', defendant='Sanderson Plumbing Prods., 530 U.S. 133 , 148,', plaintiff='Reeves', court='scotus', year=2000, reporter_found='S.Ct.', exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='S. Ct.', name="West's Supreme Court Reporter", cite_type='federal', is_scotus=True), short_name='S. Ct.', start=datetime.datetime(1882, 1, 1, 0, 0), end=None),), all_editions=(Edition(reporter=Reporter(short_name='S. Ct.', name="West's Supreme Court Reporter", cite_type='federal', is_scotus=True), short_name='S. Ct.', start=datetime.datetime(1882, 1, 1, 0, 0), end=None),), edition_guess=Edition(reporter=Reporter(short_name='S. Ct.', name="West's Supreme Court Reporter", cite_type='federal', is_scotus=True), short_name='S. Ct.', start=datetime.datetime(1882, 1, 1, 0, 0), end=None)),
 FullCaseCitation(token=CitationToken(data='147 L.Ed.2d 105', start=71, end=86, volume='147', reporter='L.Ed.2d', page='105', exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='L. Ed.', name="Lawyer's Edition", cite_type='federal', is_scotus=False), short_name='L. Ed. 2d', start=datetime.datetime(1956, 1, 1, 0, 0), end=None),), short=False, extra_match_groups={}), index=10, reporter='L. Ed. 2d', page='105', volume='147', canonical_reporter='L. Ed.', extra=None, defendant='Sanderson Plumbing Prods., 530 U.S. 133 , 148, 120 S.Ct. 2097 ,', plaintiff='Reeves', court=None, year=2000, reporter_found='L.Ed.2d', exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='L. Ed.', name="Lawyer's Edition", cite_type='federal', is_scotus=False), short_name='L. Ed. 2d', start=datetime.datetime(1956, 1, 1, 0, 0), end=None),), all_editions=(Edition(reporter=Reporter(short_name='L. Ed.', name="Lawyer's Edition", cite_type='federal', is_scotus=False), short_name='L. Ed. 2d', start=datetime.datetime(1956, 1, 1, 0, 0), end=None),), edition_guess=Edition(reporter=Reporter(short_name='L. Ed.', name="Lawyer's Edition", cite_type='federal', is_scotus=False), short_name='L. Ed. 2d', start=datetime.datetime(1956, 1, 1, 0, 0), end=None))]

It's possible for reporters to opt into pages-with-commas via the "regexes" field in reporters-db.

This one still reports "37 Taylor v" as a cite, though: 37 Taylor v. Bd. of Educ., 240 Fed. Appx. 717 (6th Cir. 2007)

If we had the data, we could apply a similar fix -- require reporters (or eventually even individual volumes) to opt into roman numeral page numbers instead of assuming any cite can have them. Maybe we could get the data for that statistically, once we've run eyecite on our collections -- just see which volumes have inbound roman numeral cites other than "v".

mlissner commented 3 years ago

IIRC, the roman numeral thing was due to citations to the foreword, I think, or similar, so I could check our DB for which citations have that, but I'm not sure we'd want to only opt those in since in theory any volume with a foreword that uses roman numerals could be cited in the future. Hm.

freelawproject / eyecite

Avoiding overlapping citation extents in find_citations.py #18