freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
118 stars 29 forks source link

Fixed parenthesis, added full_span function and test #136

Closed overmode closed 1 year ago

overmode commented 1 year ago

I created a new full_span() function that takes into account the whole citation (including plaintiff, defendant, and post_citation). We still have an issue when plaintiff name is made of several words, or when the extra attribute contains too much text, but this can be solved independently. Have a good weekend !

CLAassistant commented 1 year ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: mlissner
:x: overmode
You have signed the CLA already but the status is still pending? Let us recheck it.

mlissner commented 1 year ago

Also, don't forget to sign our license, please.

Thanks for this neat feature.

flooie commented 1 year ago

@mlissner That being said I think this is actually a great addition that I have wanted before in use.

overmode commented 1 year ago

About the license, should I personally sign it, or should the company I am working at do it ?

overmode commented 1 year ago

Fyi this is a list of examples and their spans (full span in red, regular span in blue) span_examples.zip

As well as a rough analysis of the first 100 lines of the doc (random citations from 10 random docs from the search_opinion table in the courtlistener db ).

mlissner commented 1 year ago

About the license, should I personally sign it, or should the company I am working at do it ?

I think you should sign it, but when you do you're saying that if your work owns the IP, you have the ability to assign that to us, and are doing so.

mlissner commented 1 year ago

Sorry, I won't get to this today, but it's on my list for tomorrow. I got bogged down in a major architecture issue, but tomorrow should be better.

overmode commented 1 year ago

I signed the contributor license and sent it to legal@free.law, I also pushed a new commit because I realized that stopwords such as 'citing' were included in the span

mlissner commented 1 year ago

Cool. I cleaned up a couple things, but this looks good to me on the whole. I'm going to merge and hopefully that'll run our benchmark suite, which runs eyecite across a big collection of data (I forget how much). If that runs smoothly, I think we'll be in business. I do wish I knew how span was getting set though!

flooie commented 1 year ago

@overmode - I want to like your idea of using regexes, but I want to run a few things by you. In the Pa Super example, our citation string in Courts-db is Pa. Super. Ct. How would we handle the Ct. citation. I understand how we do the whitespace but we cant just allow a wild card to handle the ct.

mlissner commented 1 year ago

I think this is complex enough that it deserves its own issue (probably in the courts DB repo), but my take is that the citation_string field is currently a static string, and it needs to change to something else. Probably there aren't so many variations that we need regexes, and I'd suggest something like:

["Pa. Super. Ct.",  "Pa.Super."]

And we could also eliminate whitespace and punctuation on the array members to good effect...probably.

mlissner commented 1 year ago

@overmode, just a quick update to let you know that 2.5.1 is now released and should have this functionality baked in. Sorry for the delay. Our automated deploy failed and I didn't catch it.