djc / instant-segment

Fast English word segmentation in Rust
Apache License 2.0
90 stars 6 forks source link

Some suggestions for ease of use #36

Open indigoviolet opened 1 year ago

indigoviolet commented 1 year ago

I wrote a small wrapper while using this library recently, and there might be some bits worth considering for upstreaming into the library.

  1. Make the common case of using the supplied unigram/bigram data automatic, by using cached_path
  2. Provide the common configuration for a Segmenter
  3. Hide the Search object or explain caveats around it - I'm not sure why it is required
  4. Handle empty strings
  5. Perhaps handle non-lowercase ascii characters (by returning them as segments or skipping them?)
from dataclasses import dataclass
from functools import lru_cache

import instant_segment
from cached_path import cached_path

BIGRAMS_DATA_PATH = (
    "https://github.com/InstantDomain/instant-segment/raw/main/data/en-bigrams.txt"
)
UNIGRAMS_DATA_PATH = (
    "https://github.com/InstantDomain/instant-segment/raw/main/data/en-unigrams.txt"
)

def unigrams():
    for ln in cached_path(UNIGRAMS_DATA_PATH).read_text().splitlines():
        parts = ln.split("\t", 1)
        yield (parts[0], float(parts[1].strip()))

def bigrams():
    for ln in cached_path(BIGRAMS_DATA_PATH).read_text().splitlines():
        word_split = ln.split(" ", 1)
        score_split = word_split[1].split("\t", 1)
        yield ((word_split[0], score_split[0]), float(score_split[1].strip()))

@dataclass
class Segmenter:
    def __post_init__(self):
        self.segmenter = instant_segment.Segmenter(unigrams(), bigrams())
        self.search = instant_segment.Search()

    def segment(self, word: str) -> list[str]:
        if not len(word):
            return []
        self.segmenter.segment(word, self.search)
        return list(self.search)

    @classmethod
    @lru_cache(maxsize=1)
    def instance(cls):
        return Segmenter()
djc commented 1 year ago
  1. I'm not sure we should bring in the dependency on cached_path, I suppose we could do a minimal version of that?
  2. Seems fairly trivial once you have 1?
  3. I guess it's mainly useful on the Rust side -- it allows you to reuse caches when running multiple segmentations in parallel.
  4. See #37.
  5. Not sure what you see as the use case? There doesn't seem a single obvious solution so I think it makes sense for the caller to handle this part.
indigoviolet commented 1 year ago

Sounds good, thanks for the quick fix on (4).

  1. Fine by me to not pull in the dependency, though I also don't see the downside
  2. 🤷🏾 All of these suggestions are trivial, just helps to go from "install library -> read some docs and write some helpers, maybe in a util file -> first use" to "install library -> use"
  3. Gotcha. Yeah, again, hiding it helps with that tiny bit of friction
  4. It's easy enough to handle, just surprised me with an error in the middle of a long list. Then there's a bit of regex to identify the offending characters, pre-split on that, then segment each split and chain them together...

Anyway, I'd be fine if you closed this issue as is, or linked to it from the docs, or took any or none of these suggestions.

djc commented 1 year ago

Would you be interested in submitting a PR to keep a Search handy inside the Python Segmenter so that you don't have to pass one in? Should be pretty straightforward...