freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
118 stars 29 forks source link

Link parallel cites #76

Open jcushman opened 3 years ago

jcushman commented 3 years ago

For a cite like "1 U.S. 1, 2 S. Ct. 2 (1999) (overruling ...)" we extract "1 U.S. 1" and "2 S. Ct. 2" as separate cites that both have the parenthetical "overruling ...". If you later report the parentheticals somehow you double up, or if you use a resolver that knows those are the same case, you double-count the weight of that citation. It would be good if we detected this and linked the two cites as parallel to each other, so the weight and parenthetical could only be counted once.

mlissner commented 3 years ago

We have a little (old, unfinished) code related to this here. It has a magic variable identifying the distance between two valid parallel citations, but I forget how carefully I analyzed that:

https://github.com/freelawproject/courtlistener/blob/3f038ff702050a193626bfba7372acf1fa167580/cl/citations/tasks.py#L19-L60

We also have a script that I never quite finished that was designed to find parallel citations (using the task above), build them into a weighted graph and then identify valid parallel citations according to the weight of the graph edges:

https://github.com/freelawproject/courtlistener/blob/2b6cae362b86e73201bb09be2c133653ca5b9d42/cl/citations/management/commands/cl_add_parallel_citations.py#L254-L270

Doing this upstream in eyecite makes sense to me.

mlissner commented 2 years ago

I had a lengthy comment above that was mostly off topic, but it looks like as we've added parentheticals to CourtListener, we've run headlong into this. (Happy to give beta access to anybody interested.)

I'm not sure what the fix is. Right now it's double counting the depth, like you say, @jcushman and it's adding the parentheticals to our DB multiple times as well. Not great!

I guess we could overhaul the API of eyecite to return citations that link these kinds of things together. For example, instead of returning a flat list of citations, we could return a list of lists of citations. Something like:

[
    "citationGroup1": [citations: [{reporter, page, volume}, {reporter, page, volume}], parenthetical: "Sky is blue"],
    "citationGroup2": [...],
]

That'd be a pretty big overhaul. Another approach would be to add a linkage attribute to subsequent citations, allowing our flat list to remain. Something like:

[
    {reporter, page, volume, parenthetical, parallel_to: null},
    {reporter, page, volume, parenthetical, parallel_to: 0},
]

If that's not clear, the parallel_to attribute would just point to the thing it's parallel to in the list of citations.

Hm. The first approach feels more correct, but the second one sure seems simpler.

@mattdahl, maybe you have thoughts too?

jcushman commented 2 years ago

Quick answer with the brain I have available ... :)

Data representation

I like Mike's second approach fine. I think the first one centers an edge case too much, you shouldn't have to wade through that layer when most cites aren't parallel. A third option that I would strongly consider would be to only include the first cite in the output list, and attach the parallel cites to that:

[
  {reporter, page, volume, parenthetical, parallel_cites: [{reporter, page, volume}]}
]

This has the benefit that if you're a naive caller of the library who has no idea about parallel cites, you'll do a reasonable default thing of ignoring them instead of double counting anything. I'm not sure what data exactly belongs in parallel_cites but hopefully the answer would jump out during implementation.

(Note that besides doubled parentheticals, there's a couple of other weird things about the current situation, like the first site has a pin_cite that contains the second cite, and the second site has a title that contains the first cite. So throwing away the second cite entirely rather than just tagging the second one as special somehow is appealing.)

Implementation

OK, the underlying scenario we're trying to detect is if you have tokens like <CitationToken>, <optional pin cite>, <CitationToken>, <optional pin cite>, <CitationToken>..., the subsequent citation tokens are parallel cites and should be stuck onto the first one somehow instead of processed from scratch. (At least we hope that's true; if someone is doing see cases such as 1 Foo 1, 2 Bar 2, and others, this'll misfire.)

So where do we detect that? A place I can think of to handle that is in add_post_citation: https://github.com/freelawproject/eyecite/blob/aaf0a203f58db2ec365b1ccb06ad9745bf931f4f/eyecite/helpers.py#L76

If we get to citation.metadata.pin_cite = clean_pin_cite(m["pin_cite"]) or None and the m["pin_cite"] was built from tokens that include CitationTokens, then that's our special case. Assuming we can detect it somehow at that point, we can then set citation.metadata.pin_cite to just the part up to the first CitationToken, and set some temporary data on citation to let us know downstream to consume the next CitationTokens as a parallel cite instead.

This answer is a little fiddly because we don't know what tokens m["pin_cite"] was built from. Not sure what's right here -- maybe re-tokenize m["pin_cite"] if that's reasonably fast. Or, before you even do the match_on_tokens(), scan out for (say) 10 tokens and if you find a CitationToken, see if the tokens up to the citation token are a pin cite, and if so that's the special case. I don't love any of that but one of those should work.

Since citations are annoying, some special cases to consider ...

mattdahl commented 2 years ago

My take: The identification of parallel citations cannot occur before resolution occurs. Otherwise, we run into the problem that Jack mentioned -- citations that are right next to each other but that refer to different cases. The only way to logically differentiate them is to wait until the citations are resolved, and then check whether they resolve to the same resource or not to determine whether they're actually parallel citations or not.

So I would argue that the return output of get_citations() should not change at all. get_citations() should be unopinionated on this point -- it should just return a list of every citation in an opinion without attempting to aggregate any possible duplicates.

If the user needs to deal with parallel citations, they should then run resolve_citations() as normal. Then I would propose that we implement a new function -- prune_parallel_citations() or something, this could also be configured as a step in the resolve_citations() function itself -- that goes through the list after resolutions have been made and deletes/aggregates the parallel citations in each resource group. To detect parallelness at this stage, we can just check whether each citation's span() is sufficiently close to some (configurable) distance away from another citation's span() in the same resource group. Because resolutions have already been made, we'll then know for sure that those citations are in fact parallel, and not just nearby different citations.

The downside to this approach is that it depends on the user having a competent backend for doing resolutions. This is not a problem for CL and CAP, but the default implementation of resolve_citations() will obviously not recognize citations from different reporters as resolving to the same resource (even if they're true parallel citations). However, I'm okay with this, because I think the user ultimately needs to grapple with the fact that the identification of parallel citations cannot conceptually happen before the citations are resolved, so they need to deal with making sure their resolutions are sane first.

Happy to discuss further!

mlissner commented 2 years ago

I think I agree with Matt in terms of when this should happen. That's sort of what we've got right now when searching for parentheticals. The way it works now is that if two citations have the same parenthetical, they're the same and we call that good enough, but we could do better, I'm sure.

jcushman commented 2 years ago

My take: The identification of parallel citations cannot occur before resolution occurs.

I don't want this to be the answer, though that doesn't mean it isn't. :)

I don't like it because it makes our parsing process so different from the intuitive order for extracting metadata about a cite, which will lead to lots of messiness. Like if the source text is:

... blah blah blah. Foo v. Bar, 12 Mass. 34, 35, 56 N.E.2d 78, 79 (Mass. 1999) (holding blah). blah blah blah ...

A human would see a single citation with title="Foo v. Bar", cite="12 Mass. 34", parenthetical="holding blah", etc., and would just see "56 N.E.2d 78, 79" as a bit of extra metadata for the cite.

We currently come out with two complete cites with overlapping parses, leading to messy metadata:

[
  FullCaseCitation('12 Mass. 34', ... metadata=(parenthetical='holding blah', pin_cite='35', year='1999', plaintiff='Foo', defendant='Bar', extra='56 N.E.2d 78, 79', ...)),
  FullCaseCitation('56 N.E.2d 78', metadata=(parenthetical='holding blah', pin_cite='79', year='1999', plaintiff='Foo', defendant='Bar, 12 Mass. 34, 35', ...))
]

The second cite in particular is messy (defendant='Bar, 12 Mass. 34, 35') because it doesn't know that it isn't really a complete cite on its own, just a bit of metadata for a previous cite, so it can't hope to work out what's going on around it. If we rely on resolution for deleting these, we'll end up with these messy confused cites any time we don't know all of the cites for a case, which for CAP at least is common. FLP might be better on knowing parallel cites for each case, not sure how close it is to 100%.

Thinking about alternatives, there probably aren't that many legitimate reporter pairs that appear as parallel cites -- could we list them all and add them to reporters-db? If FLP does have pretty good parallel cite info, maybe one could dump all the allowed patterns (Mass. + N.E.2d, Mass. + Tyng etc.) and only treat pairs as parallel if they're on the list. I'd think 12 Mass. 34, 56 N.E.2d 78 would rarely be two separate cases, particularly because it'd be an odd mix of citation styles if it were.

bbernicker commented 2 years ago

Just wanted to jump in to note that, however we decide to handle parallel case citations, we should probably mirror that treatment for parallel statutory (freelawproject/reporters-db#116 and Ind. R16.1.5) and treaty citations (freelawproject/reporters-db#48 and BB R21.4).

mattdahl commented 1 year ago

Reading this again a year later, I now think I agree with @jcushman. I think in practice my previous proposal is intractable because resolution is too inaccurate (especially for the user running locally without e.g. the FLP database).

I also like the idea of using a list of allowed parallel citation combinations as a detection heuristic. Relatedly, is there any way to create a mapping between reporter volumes and years using the FLP data? If we knew that two citations are (1) physically close to each other, (2) are from reporters that are known to be a valid combination, and (3) are published in volumes with nearby years, I think we could be pretty confident that the citations are indeed parallel citations and we could collapse them appropriately.

mlissner commented 1 year ago

Relatedly, is there any way to create a mapping between reporter volumes and years using the FLP data?

Well, maybe. We have a lot of parallel cites in the DB, but not all of them. We could probably export that and you'd have something.

are published in volumes with nearby years

We don't always have years for reporters, unfortunately, but we could probably get them for the most important ones.

Reading this again a year later, I now think I agree with @jcushman.

Yeah, me too. I think it makes sense that the whole string is really a citation object, even if it references multiple books.

Let us know if you want to try to get a database export, Matt.