Open overmode opened 1 year ago
Hey @overmode
Thanks for the write up. There is a method for FullCaseCitations
called corrected_citation_full
It returns the full normalized string.
citations = [
'the asdf asdf the asdfa sd Commonwealth v. Gibson, 561 A.2d 1240 1242 asdf asdf asdf ',
'Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992)'
]
for cite in citations:
cite = get_citations(cite)[0].corrected_citation_full())
When you run it - it provides the full citation including names, but I believe there is a bug in it when it uses dates and courts.
if you wanted to take a look at eyecite.models.FullCaseCitation.corrected_citation_full
and fix the bug related to date and court it would return something like
Commonwealth v. Gibson, 561 A.2d 1240
Commonwealth v. Bauer, 604 A.2d 1098 (pasuperct 1992)
for the example above.
Is the idea, @overmode, to remove all citations to make it better training data?
One other thing to know, @overmode, is that the way we identify the name of the case is very sloppy. It just uses heuristics around where it finds a v.
, if it finds one, and otherwise, just grabs the average length of a case name, I think. It's hardcoded around 30 tokens IIRC>
Hey, thanks for the quick reply. @mlissner Indeed, the idea is to build a training set for some machine learning application.
I took note of your method, it's ok if the recall of citation extraction is not excellent because I have many documents anyway, but I will need a way to tell whether the parsing went well to at least have a good precision.
@flooie I tried the eyecite.models.FullCaseCitation.corrected_citation_full method
, it does break at the second example :
====================
EXTRACTED : FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
CORRECTED_CITATION_FULL : Commonwealth v. Gibson, 561 A.2d 1240, 1242
CITATION SPAN : Commonwealth v. Gibson, [BEGIN] 561 A.2d 1240 [END] 1242
====================
EXTRACTED : FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
Error executing job with overrides: []
Traceback (most recent call last):
File "check_samples.py", line 59, in main
print('CORRECTED_CITATION_FULL :', extracted_citation.corrected_citation_full())
File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in corrected_citation_full
publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in <genexpr>
publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
TypeError: 'Metadata' object is not subscriptable
This is not exactly what I would like, though, because it is not exact text that was matched (notice the added comma between page numbers). Is there a better way to find back the latter text ?
Also, Is this the bug you pointed out ? I'm open to a PR in case there is no better workaround, so I would appreciate if you have insights to share already.
[UPDATE]
I fixed the bug by replacing the line by publisher_date = " ".join(i for i in (m.court, m.year) if i)
The extracted full citation for the second example becomes
Commonwealth v. Bauer, 604 A.2d 1098 (1992
The parenthesis is not closed because in eyecite.models
, line 362, we have
if publisher_date:
parts.append(f" ({publisher_date}")
I assume that a parenthesis is missing at the end. Does it make sense for the Pa. Super. not to be included here ?
Just chiming in here since I saw your PR (#136) and was surprised that this wasn't already possible! Thanks for implementing it!
Separate from your changes in the PR, I was also curious about the court issue. It seems that the Pa.Super.
is not being extracted properly because the citation_string
listed for the PA Superior Court is "Pa. Super. Ct."
(line 46902 here: https://github.com/freelawproject/courts-db/blob/main/courts_db/data/courts.json). The problem is the space between the Pa.
and the Super.
. This also seems like something that should be fixed -- would it cause problems to just ignore whitespace when matching court abbreviations here: https://github.com/freelawproject/eyecite/blob/main/eyecite/helpers.py#L52? May be related to the changes proposed in #129
@mattdahl Thanks !
Maybe we should consider moving away from exact string matching and use simple regex instead ?
For instance in r'\s*pa\s*super\s*'
, we would not be dependent on the spacing, and we could also make it robust to punctuation.
I don't think it would hurt speed a lot
@overmode every time I see the words simple
and regex
I get nervous. I'm not sure I see how this relatively simple situation is resolved with regex.
I understand, regexes are powerful but scale badly. Well, the equivalent in python here would be to remove all punctuation and spaces, and then look for 'pasuper'. I think the question was more whether it would not work in some corner cases, and you are much more knowledgeable than I am.
For the court issue, the question is essentially, "What bad things will happen if we broaden how we match court strings against the text?"
Honestly, I don't think anybody knows. Right now we do two things. We:
string_puc
, and westartswith
to strip terminal periods, which sometimes seem to interfereIf we went a step further and matched with regexes or by taking out whitespace, would we have false matches? I don't know, but I know how to check!
If we want to run this down, I think the trick is to look at the citation_string
values for every court in the courts DB and see what happens if you strip out spaces in addition to stripping out punctuation. I think it might be fine, but what we'd want to watch out for are two courts with nearly identical citation strings that overlap due to this. If there's no collisions caused by that analysis, I'd say yeah, let's add a third step to how we normalize and compare citation strings.
Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae
It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.
However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.
Yeah, that jumped out at me too. @flooie what's your take on that?
Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae
It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.
![]()
@mattdahl - we had imported a lot of courts - that were low level county, town courts and in ny a few of courts had been generated with the parent citation string.
For example, New York County Court -> has like 50+ County courts and they were generated with N.Y. Cty. Ct. as the citation string instead of NY Cty. Ct., Suffolk Cty. ... etc. I went thru and fixed the 100 or so collisions
Nice!! The only duplicate left is N.Y. Cty. Ct., Nassau Cty.
-- is that intentional?
no- ha - thats just a duplicate court. I'll strip that in a second. I have a few things to add about courts and citation strings. Ill add momentarily
Hi, thank you for the great library !
Problem description
I am preparing a dataset, in which I would like to mask some citations, e.g. replacing them by "[CITATION]". I could not find a way to get the full span of the citation. Indeed, only the normalized part is covered by the builtin span() function (see below)
output :
One can see that the span only partially covers the citation text. If possible, I would like to avoid using regex for recovering the full span. Concatenating the lengths of the citation's attributes (plaintiff, defendant, etc.) does not seem to be a viable solution as well, because the second example misses the "Pa. Super" text.
Desired behavior
It would be nice to have a 'full_span()' function such that, if I use it instead of span() in the above example, I get
Specs
eyecite version : 2.4.0