Closed RyanMcCarl closed 5 years ago
Hi @RyanMcCarl - glad you like using this tool. Sorry, there is no parameter to modify the abbreviations. But if you fork the tool and add a line to the ABBREVIATIONS string (in segtok/segmenter.py) you should be able to add anything you need fairly quickly.
Oh, and naturally, PRs that make abbreviations a parametric option are welcome! :-)
Great, thanks!
Actually, quick follow-up: I just realized I am looking at the older segtok code. I don't see the abbreviations variable in the newer syntok code; the new code must handle abbreviations differently? I am happy to fork and work from this one as well, but wanted to be sure that is the right choice. Thanks again. @fnl
Well, that depends, @RyanMcCarl - segtok has been around for a while and is certainly production-ready and hardened by multiple people/orgs using it. syntok simplifies a lot of things, fixes cases that segtok cannot handle, and gets the design/order right (first tokenize, then segment).
That said, syntok also has abbreviation handling, and the abbreviations are found in syntok/_segmentation_states/State#abbreviations (in line 21 right now).
So yes, if you are willing to take the risk of being an early adopter, and like the new design, I definitely recommend choosing syntok over segtok from a design PoV. Your call....
Great, thank you. @fnl
Note that syntok now probably has matured enough to start using it more widely. And, you can set your own abbreviation list or expand the existing one (as shown next) quite easily there with a little trick:
from syntok._segmentation_states import State
State.abbreviation = frozenset(your_list + list(State.abbrevation))
This is a brilliant program. I have been looking for a way to reliably split sentences and find citations in legal writing, which is full of abbreviations and parentheses. This script could probably go a long way toward handling it if I added a list of common abbreviations or regular expressions designed to identify legal citations. Here's a typically convoluted citation sentence that you might find in a legal brief:
(See, e.g., Garrett v. Coast & Southern Fed. Sav. & Loan Assn., 9 Cal. 3d 731 (Cal. 1973) (overruling Finger v. McCaughey, 114 Cal. 64 (1896), and holding that a retroactive interest charge is a penalty).)
Short of interfering with the code directly in a fork, is there a way I could extend the list of abbreviations or regular expressions to create rules catching common abbreviations and patterns? @fnl