fnl / segtok

Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.
http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
MIT License
170 stars 22 forks source link

Extensibility through custom regex or abbreviation lists? #18

Closed RyanMcCarl closed 5 years ago

RyanMcCarl commented 5 years ago

This is a brilliant program. I have been looking for a way to reliably split sentences and find citations in legal writing, which is full of abbreviations and parentheses. This script could probably go a long way toward handling it if I added a list of common abbreviations or regular expressions designed to identify legal citations. Here's a typically convoluted citation sentence that you might find in a legal brief:

(See, e.g., Garrett v. Coast & Southern Fed. Sav. & Loan Assn., 9 Cal. 3d 731 (Cal. 1973) (overruling Finger v. McCaughey, 114 Cal. 64 (1896), and holding that a retroactive interest charge is a penalty).)

Short of interfering with the code directly in a fork, is there a way I could extend the list of abbreviations or regular expressions to create rules catching common abbreviations and patterns? @fnl

fnl commented 5 years ago

Hi @RyanMcCarl - glad you like using this tool. Sorry, there is no parameter to modify the abbreviations. But if you fork the tool and add a line to the ABBREVIATIONS string (in segtok/segmenter.py) you should be able to add anything you need fairly quickly.

fnl commented 5 years ago

Oh, and naturally, PRs that make abbreviations a parametric option are welcome! :-)

RyanMcCarl commented 5 years ago

Great, thanks!

RyanMcCarl commented 5 years ago

Actually, quick follow-up: I just realized I am looking at the older segtok code. I don't see the abbreviations variable in the newer syntok code; the new code must handle abbreviations differently? I am happy to fork and work from this one as well, but wanted to be sure that is the right choice. Thanks again. @fnl

fnl commented 5 years ago

Well, that depends, @RyanMcCarl - segtok has been around for a while and is certainly production-ready and hardened by multiple people/orgs using it. syntok simplifies a lot of things, fixes cases that segtok cannot handle, and gets the design/order right (first tokenize, then segment).

That said, syntok also has abbreviation handling, and the abbreviations are found in syntok/_segmentation_states/State#abbreviations (in line 21 right now).

So yes, if you are willing to take the risk of being an early adopter, and like the new design, I definitely recommend choosing syntok over segtok from a design PoV. Your call....

RyanMcCarl commented 5 years ago

Great, thank you. @fnl

fnl commented 5 years ago

Note that syntok now probably has matured enough to start using it more widely. And, you can set your own abbreviation list or expand the existing one (as shown next) quite easily there with a little trick:

from syntok._segmentation_states import State

State.abbreviation = frozenset(your_list + list(State.abbrevation))