freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
114 stars 27 forks source link

Getting full citation span #135

Open overmode opened 1 year ago

overmode commented 1 year ago

Hi, thank you for the great library !

Problem description

I am preparing a dataset, in which I would like to mask some citations, e.g. replacing them by "[CITATION]". I could not find a way to get the full span of the citation. Indeed, only the normalized part is covered by the builtin span() function (see below)

 import eyecite         

citations = [
    'Commonwealth v. Gibson, 561 A.2d 1240 1242',
    'Commonwealth v. Bauer, 604 A.2d 1098 (Pa.Super. 1992)'
]

for citation in citations :
    print('\n', '='*20)
    extracted_citation = eyecite.get_citations(citation)[0]
    print(extracted_citation)

    start_idx = extracted_citation.span()[0]
    end_idx = extracted_citation.span()[1]

    before_cit = citation[:start_idx]
    cit_text = citation[start_idx:end_idx]
    after_cit = citation[end_idx:]
    print(f"{before_cit} [BEGIN] {cit_text} [END] {after_cit}")

output :

====================
FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
Commonwealth v. Gibson,  [BEGIN] 561 A.2d 1240 [END]  1242

 ====================
FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
Commonwealth v. Bauer,  [BEGIN] 604 A.2d 1098 [END]  (Pa.Super. 1992)

One can see that the span only partially covers the citation text. If possible, I would like to avoid using regex for recovering the full span. Concatenating the lengths of the citation's attributes (plaintiff, defendant, etc.) does not seem to be a viable solution as well, because the second example misses the "Pa. Super" text.

Desired behavior

It would be nice to have a 'full_span()' function such that, if I use it instead of span() in the above example, I get

====================
FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
 [BEGIN]Commonwealth v. Gibson, 561 A.2d 1240 1242[END]

 ====================
FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
[BEGIN]Commonwealth v. Bauer,  604 A.2d 1098 (Pa.Super. 1992)[END] 

Specs

eyecite version : 2.4.0

flooie commented 1 year ago

Hey @overmode

Thanks for the write up. There is a method for FullCaseCitations called corrected_citation_full

It returns the full normalized string.

        citations = [
            'the asdf asdf the asdfa sd Commonwealth v. Gibson, 561 A.2d 1240 1242 asdf asdf asdf ',
            'Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992)'
        ]
        for cite in citations:
            cite = get_citations(cite)[0].corrected_citation_full())

When you run it - it provides the full citation including names, but I believe there is a bug in it when it uses dates and courts.

if you wanted to take a look at eyecite.models.FullCaseCitation.corrected_citation_full and fix the bug related to date and court it would return something like

Commonwealth v. Gibson, 561 A.2d 1240 Commonwealth v. Bauer, 604 A.2d 1098 (pasuperct 1992)

for the example above.

mlissner commented 1 year ago

Is the idea, @overmode, to remove all citations to make it better training data?

mlissner commented 1 year ago

One other thing to know, @overmode, is that the way we identify the name of the case is very sloppy. It just uses heuristics around where it finds a v., if it finds one, and otherwise, just grabs the average length of a case name, I think. It's hardcoded around 30 tokens IIRC>

overmode commented 1 year ago

Hey, thanks for the quick reply. @mlissner Indeed, the idea is to build a training set for some machine learning application.

I took note of your method, it's ok if the recall of citation extraction is not excellent because I have many documents anyway, but I will need a way to tell whether the parsing went well to at least have a good precision.

@flooie I tried the eyecite.models.FullCaseCitation.corrected_citation_full method, it does break at the second example :

====================
EXTRACTED : FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
CORRECTED_CITATION_FULL : Commonwealth v. Gibson, 561 A.2d 1240, 1242
CITATION SPAN : Commonwealth v. Gibson,  [BEGIN] 561 A.2d 1240 [END]  1242

 ====================
EXTRACTED : FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
Error executing job with overrides: []
Traceback (most recent call last):
  File "check_samples.py", line 59, in main
    print('CORRECTED_CITATION_FULL :', extracted_citation.corrected_citation_full())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in corrected_citation_full
    publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in <genexpr>
    publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
TypeError: 'Metadata' object is not subscriptable

This is not exactly what I would like, though, because it is not exact text that was matched (notice the added comma between page numbers). Is there a better way to find back the latter text ?

Also, Is this the bug you pointed out ? I'm open to a PR in case there is no better workaround, so I would appreciate if you have insights to share already.

[UPDATE] I fixed the bug by replacing the line by publisher_date = " ".join(i for i in (m.court, m.year) if i) The extracted full citation for the second example becomes Commonwealth v. Bauer, 604 A.2d 1098 (1992

The parenthesis is not closed because in eyecite.models, line 362, we have

if publisher_date:
            parts.append(f" ({publisher_date}")

I assume that a parenthesis is missing at the end. Does it make sense for the Pa. Super. not to be included here ?

mattdahl commented 1 year ago

Just chiming in here since I saw your PR (#136) and was surprised that this wasn't already possible! Thanks for implementing it!

Separate from your changes in the PR, I was also curious about the court issue. It seems that the Pa.Super. is not being extracted properly because the citation_string listed for the PA Superior Court is "Pa. Super. Ct." (line 46902 here: https://github.com/freelawproject/courts-db/blob/main/courts_db/data/courts.json). The problem is the space between the Pa. and the Super.. This also seems like something that should be fixed -- would it cause problems to just ignore whitespace when matching court abbreviations here: https://github.com/freelawproject/eyecite/blob/main/eyecite/helpers.py#L52? May be related to the changes proposed in #129

overmode commented 1 year ago

@mattdahl Thanks ! Maybe we should consider moving away from exact string matching and use simple regex instead ? For instance in r'\s*pa\s*super\s*', we would not be dependent on the spacing, and we could also make it robust to punctuation. I don't think it would hurt speed a lot

flooie commented 1 year ago

@overmode every time I see the words simple and regex I get nervous. I'm not sure I see how this relatively simple situation is resolved with regex.

overmode commented 1 year ago

I understand, regexes are powerful but scale badly. Well, the equivalent in python here would be to remove all punctuation and spaces, and then look for 'pasuper'. I think the question was more whether it would not work in some corner cases, and you are much more knowledgeable than I am.

mlissner commented 1 year ago

For the court issue, the question is essentially, "What bad things will happen if we broaden how we match court strings against the text?"

Honestly, I don't think anybody knows. Right now we do two things. We:

If we went a step further and matched with regexes or by taking out whitespace, would we have false matches? I don't know, but I know how to check!

If we want to run this down, I think the trick is to look at the citation_string values for every court in the courts DB and see what happens if you strip out spaces in addition to stripping out punctuation. I think it might be fine, but what we'd want to watch out for are two courts with nearly identical citation strings that overlap due to this. If there's no collisions caused by that analysis, I'd say yeah, let's add a third step to how we normalize and compare citation strings.

mattdahl commented 1 year ago

Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae

It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

mlissner commented 1 year ago

However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

Yeah, that jumped out at me too. @flooie what's your take on that?

flooie commented 1 year ago

Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae

It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

Screenshot 2023-01-25 at 2 42 41 PM
flooie commented 1 year ago

@mattdahl - we had imported a lot of courts - that were low level county, town courts and in ny a few of courts had been generated with the parent citation string.

For example, New York County Court -> has like 50+ County courts and they were generated with N.Y. Cty. Ct. as the citation string instead of NY Cty. Ct., Suffolk Cty. ... etc. I went thru and fixed the 100 or so collisions

mattdahl commented 1 year ago

Nice!! The only duplicate left is N.Y. Cty. Ct., Nassau Cty. -- is that intentional?

flooie commented 1 year ago

no- ha - thats just a duplicate court. I'll strip that in a second. I have a few things to add about courts and citation strings. Ill add momentarily