freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 150 forks source link

find_citations.py might overgenerate on addresses #1338

Closed slbayer closed 4 years ago

slbayer commented 4 years ago

Just recently, I've extracted the CourtListener citation finder into my own package in order to do citation finding outside CourtListener in arbitrary legal documents. I've discovered that the citation finder identifies patterns like "1211 SW 5th" as a citation. In all the cases in my corpus of documents where this pattern occurs (those which normalize to the S.W., N.W., N.E. and S.E. reporters), if the base citation ends in "th", it's actually an address and not a citation. These cases don't occur frequently, but they do occur.

To fix this, I inserted the following filter after the call to disambiguate_reporters() at line 633 in find_citations.py:

    oldCitations = citations
    citations = []
    for c in oldCitations:
        if isinstance(c, FullCitation):
            pk = c.base_citation()
            if not (pk.endswith("th") and any([addrComp in pk for addrComp in ("N.E.", "N.W.", "S.E.", "S.W.")])):
                citations.append(c)
        else:
            citations.append(c)
mlissner commented 4 years ago

Hey, thanks for filing this bug. It's gratifying to learn that the library is being used elsewhere.

A couple questions:

  1. Why doesn't this affect ones ending in nd or st as well?

  2. Why do this after disambiguation? Wouldn't it be better up around line 300, in extract_full_citation?

    If you do it there, you can do it closer to where you're doing your parsing and it would be less likely to catch something it shouldn't.

Can you say more about how you're using the code? We're always curious to know.

slbayer commented 4 years ago

Sorry for the delay - I seem not to have gotten a notification.

Believe it or not, it hadn't occurred to me to test nd and st - I was working from an error analysis a colleague provided. I've tested those suffixes in the same context, and yes, I get the same overgeneralization.

The reason to do it after disambiguation is that then you can execute the filter on the canonicalized citation, rather than deal with all the variants SW, S.W., etc.

I don't disagree with your observation, though.

What are we doing? I probably can't say a lot, but we're doing some citation analysis in legal documents, not just court cases, but also documents filed by the parties.

I will also say that I, along with one of my colleagues, really wish that your inference tools, like the citation finder, were available separately from the CourtListener Django infrastructure. I can see how you're leveraging them together, but we had no need for the service infrastructure, and the extent to which I had to butcher the code to make it stand alone will make it somewhat difficult to track any changes in CL proper, sadly.

mlissner commented 4 years ago

Believe it or not, it hadn't occurred to me to test nd and st

Fair enough; I forgot to mention 3rd too!

I tried writing this prior to disambiguation, but you're right, it was too much of a pain to identify the reporter consistently. You can see the final diff in: https://github.com/freelawproject/courtlistener/commit/616ae942802608c428e97f61461a764c083bf50f

We agree about pulling citation finding out of CL itself and into a library. One day we'll have to do that, but we've wanted to do so for several years. It just doesn't move our mission forward much to spend time on it, unfortunately. Did you create a library when working on this? :)

slbayer commented 4 years ago

Yes; in fact, right now I'm refactoring it again because I need to encapsulate state to deal with another bug that I'll file as soon as I figure out what the right answer is.

Not sure I can offer back what I've done; but I'll check.