alexharri / beygla

Tiny (5kB gzipped) declension helper for Icelandic names.
MIT License
28 stars 2 forks source link

Add 'beygla/strict' module #15

Closed alexharri closed 7 months ago

alexharri commented 7 months ago

Closes #14

What

Adds a new strict version of beygla, which is accessed under beygla/strict:

import { applyCase } from "beygla/strict";

There are two main differences between beygla and beygla/strict:

The reason for the 3x size increase is that the beygla/strict version encodes all legal Icelandic names and bundles them in the library.

Because only known names are declined in beygla/strict, the declensions are guaranteed to be correct. The tradeoff, aside from the bundle size, is that correct declensions for non-Icelandic names are not applied.

How

Name encoding

The set of Icelandic names is encoded in a single large string. The string contains a trie-encoding that works like so:

Here's an example of names encoded using this method:

ás.t.vald.ur.<<<<r.<<in.<<eig.<<<<björn.

This encodes the following names:

Ás
Ást
Ástvald
Ástvaldur
Ástvar
Ástvin
Ástveig
Ástbjörn

I tried various compression methods such as:

All of these methods reduced the size of the string. However, each of them made gzip compression less effective and resulted in a net size increase. For that reason we stick with the super-simple encoding.

Add setPredicate to beygla

To avoid polluting the interface of applyCase, beygla exposes a new undocumented setPredicate export that can be used to provide a predicate that determines whether or not a name is declined.

beygla/strict uses this by providing a predicate and re-exporting beygla:

import { setPredicate } from "./beygla";

function isIcelandicName(name: string): boolean {
  // ...
}
setPredicate(isIcelandicName);

export * from "./beygla";

This guarantenes that the API for beygla and beygla/strict stays the same.

Drive-by

Handle multiple name categories for single entry in BÍN data

There are 3 entries in the BÍN data containing multiple word categories, one of which added since beygla was last updated.

Instead of filtering them out, as is currently done, we now treat multiple categories for a single entry as valid.

oddsson commented 7 months ago

This solution looks awesome to me 🤩

Thanks so much for acting on this..