SuffolkLITLab / FormFyxer

A tool for learning about and pre-processing forms
MIT License
11 stars 1 forks source link

Citations don't have useful information #81

Closed BryceStevenWilley closed 1 year ago

BryceStevenWilley commented 1 year ago

Many of the citations that I've seen don't have any useful associated information with them. For example, here is an example of a California that cites the code of civil procedure (in the bottom right corner). The relevant text is Code of Civil Procedure, § 706.124. Because that specific regex string isn't in reporters_db, eyecite only captures that the § character is a chapter of some sort, but only returns UnknownCitation('§', metadata=CitationBase.Metadata(parenthetical=None)), which by default doesn't give any of the relevant information surrounding the citation. That sort of info is important to know if we want to suggest people remove citations from forms.

My best suggestion for moving forward is to use the index attribute of the object to get the position in the original text, and grab at least 10 tokens before and after the symbol for context, which we can print when we print the citation. The difficulty is recreating the tokenization process (I think they include whitespace as separate tokens).

nonprofittechy commented 1 year ago

I sent an email about this a few months ago to the upstream team (but didn't open a corresponding issue): https://github.com/freelawproject/eyecite.