Adds a new strict version of beygla, which is accessed under beygla/strict:
import { applyCase } from "beygla/strict";
There are two main differences between beygla and beygla/strict:
beygla declines all names it finds a pattern for. beygla/strict only declines Icelandic names (as specified in icelandic-names.csv).
beygla is <5kB while beygla/strict is <15kB.
The reason for the 3x size increase is that the beygla/strict version encodes all legal Icelandic names and bundles them in the library.
Because only known names are declined in beygla/strict, the declensions are guaranteed to be correct. The tradeoff, aside from the bundle size, is that correct declensions for non-Icelandic names are not applied.
How
Name encoding
The set of Icelandic names is encoded in a single large string. The string contains a trie-encoding that works like so:
Initialize an empty stack of characters.
For each character in the string:
If . is encountered, the current stack represents an Icelandic name.
If < is encountered, pop the last character from the stack.
If any other character is encountered, append it to the stack.
Here's an example of names encoded using this method:
ás.t.vald.ur.<<<<r.<<in.<<eig.<<<<björn.
This encodes the following names:
Ás
Ást
Ástvald
Ástvaldur
Ástvar
Ástvin
Ástveig
Ástbjörn
I tried various compression methods such as:
Pack the bits of the 5/6 bit characters into bytes (Icelandic characters can be encoded using 6 bits, or 5 if you add a separate character to denote accented characters).
Compress long <<<< sequences into numbers e.g. <<<< becomes 4.
Use uppercase to denote the end of a string e.g. ás.t.vald.ur. becomes áSTvalDuR.
All of these methods reduced the size of the string. However, each of them made gzip compression less effective and resulted in a net size increase. For that reason we stick with the super-simple encoding.
Add setPredicate to beygla
To avoid polluting the interface of applyCase, beygla exposes a new undocumented setPredicate export that can be used to provide a predicate that determines whether or not a name is declined.
beygla/strict uses this by providing a predicate and re-exporting beygla:
import { setPredicate } from "./beygla";
function isIcelandicName(name: string): boolean {
// ...
}
setPredicate(isIcelandicName);
export * from "./beygla";
This guarantenes that the API for beygla and beygla/strict stays the same.
Drive-by
Handle multiple name categories for single entry in BÍN data
There are 3 entries in the BÍN data containing multiple word categories, one of which added since beygla was last updated.
Instead of filtering them out, as is currently done, we now treat multiple categories for a single entry as valid.
Closes #14
What
Adds a new
strict
version ofbeygla
, which is accessed underbeygla/strict
:There are two main differences between
beygla
andbeygla/strict
:beygla
declines all names it finds a pattern for.beygla/strict
only declines Icelandic names (as specified inicelandic-names.csv
).beygla
is <5kB whilebeygla/strict
is <15kB.The reason for the 3x size increase is that the
beygla/strict
version encodes all legal Icelandic names and bundles them in the library.Because only known names are declined in
beygla/strict
, the declensions are guaranteed to be correct. The tradeoff, aside from the bundle size, is that correct declensions for non-Icelandic names are not applied.How
Name encoding
The set of Icelandic names is encoded in a single large string. The string contains a trie-encoding that works like so:
.
is encountered, the current stack represents an Icelandic name.<
is encountered, pop the last character from the stack.Here's an example of names encoded using this method:
This encodes the following names:
I tried various compression methods such as:
<<<<
sequences into numbers e.g.<<<<
becomes4
.ás.t.vald.ur.
becomesáSTvalDuR
.All of these methods reduced the size of the string. However, each of them made gzip compression less effective and resulted in a net size increase. For that reason we stick with the super-simple encoding.
Add
setPredicate
tobeygla
To avoid polluting the interface of
applyCase
,beygla
exposes a new undocumentedsetPredicate
export that can be used to provide a predicate that determines whether or not a name is declined.beygla/strict
uses this by providing a predicate and re-exportingbeygla
:This guarantenes that the API for
beygla
andbeygla/strict
stays the same.Drive-by
Handle multiple name categories for single entry in BÍN data
There are 3 entries in the BÍN data containing multiple word categories, one of which added since
beygla
was last updated.Instead of filtering them out, as is currently done, we now treat multiple categories for a single entry as valid.