646e62 / case-brief

Generates a FIRAC-style case brief from a reported decision
GNU General Public License v3.0
2 stars 0 forks source link

spaCy sentencizer incorrectly breaks paragraphs, sentences #7

Closed 646e62 closed 1 year ago

646e62 commented 1 year ago

SpaCy treats some commonly abbreviated terms in written decisions (eg, "para." for "paragraph", "Cst." for "Constable", etc) as sentence breaks. This both adds time onto manual data cleaning and makes getting input more prohibitive. The data cleaning functions need to be updated to correct these problems so that text files can be automatically and reliably loaded into the program.

646e62 commented 1 year ago

The solution we discussed this morning worked, but at the price of removing the word from the sentence entirely. Instead, I implemented a solution that just removes the period at the end of the abbreviated words. It seems to be working okay so far, and likely will going forward, as none of these abbreviations tend to be at the end of sentences.