Closed overmode closed 1 year ago
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.
:white_check_mark: mlissner
:x: overmode
You have signed the CLA already but the status is still pending? Let us recheck it.
Also, don't forget to sign our license, please.
Thanks for this neat feature.
@mlissner That being said I think this is actually a great addition that I have wanted before in use.
About the license, should I personally sign it, or should the company I am working at do it ?
Fyi this is a list of examples and their spans (full span in red, regular span in blue) span_examples.zip
As well as a rough analysis of the first 100 lines of the doc (random citations from 10 random docs from the search_opinion table in the courtlistener db ).
About the license, should I personally sign it, or should the company I am working at do it ?
I think you should sign it, but when you do you're saying that if your work owns the IP, you have the ability to assign that to us, and are doing so.
Sorry, I won't get to this today, but it's on my list for tomorrow. I got bogged down in a major architecture issue, but tomorrow should be better.
I signed the contributor license and sent it to legal@free.law, I also pushed a new commit because I realized that stopwords such as 'citing' were included in the span
Cool. I cleaned up a couple things, but this looks good to me on the whole. I'm going to merge and hopefully that'll run our benchmark suite, which runs eyecite across a big collection of data (I forget how much). If that runs smoothly, I think we'll be in business. I do wish I knew how span
was getting set though!
@overmode - I want to like your idea of using regexes, but I want to run a few things by you. In the Pa Super example, our citation string in Courts-db is Pa. Super. Ct. How would we handle the Ct. citation. I understand how we do the whitespace but we cant just allow a wild card to handle the ct.
I think this is complex enough that it deserves its own issue (probably in the courts DB repo), but my take is that the citation_string
field is currently a static string, and it needs to change to something else. Probably there aren't so many variations that we need regexes, and I'd suggest something like:
["Pa. Super. Ct.", "Pa.Super."]
And we could also eliminate whitespace and punctuation on the array members to good effect...probably.
@overmode, just a quick update to let you know that 2.5.1 is now released and should have this functionality baked in. Sorry for the delay. Our automated deploy failed and I didn't catch it.
I created a new full_span() function that takes into account the whole citation (including plaintiff, defendant, and post_citation). We still have an issue when plaintiff name is made of several words, or when the extra attribute contains too much text, but this can be solved independently. Have a good weekend !