alexharri / beygla

Tiny (5kB gzipped) declension helper for Icelandic names.
MIT License
28 stars 2 forks source link

Option to not try to apply case to non icelandic names #14

Closed oddsson closed 8 months ago

oddsson commented 8 months ago

Hi! 👋

Would it be possible to add an option to applyCase to control whether or not beygla tries to apply case to foreign names?

Example

Current functionality: applyCase('þgf', "Carlos") => Carlosi Wanted functionality: applyCase('þgf', "Carlos", {applyToForeignNames: false}) => Carlos

Maybe we could do a lookup in /data/icelandic-names.csv if applyToForeignNames is set to false 🤷

alexharri commented 8 months ago

Hey,

I'd love to hear more about the use case. Beygla declines Carlos as (nf Carlos, þf Carlos, þgf Carlosi, ef Carlosar) which seems right to me. Is that not correct? If so, why not?

I'm not implying that this feature is not valuable or that I don't want to support it, I'm just curious as to why disabling declension for foreign names is desirable.


Implementation wise, the main challenge is being able to determine whether a name is an Icelandic name or not. This requires encoding the set of Icelandic names and including it in the bundle.

Encoding the set of names will take up some kilobytes, so I would be hesitant to include it in the default beygla export (I imagine it would at least double/triple the size of the library). However, we could add a "strict" version of the module:

import { applyCase } from "beygla/strict";

Where the beygla/strict module contains the encoded set of names and does something like so:

import { applyCase as originalApplyCase } from "./beygla";

function isIcelandicName(name: string): boolean {
  // ...
}

export function applyCase(...) {
  // Some conditional behavior based on 'isIcelandicName'
}

Anyway, I'll explore how significantly we can compress the list of Icelandic names. It seems like a fun problem to solve.

alexharri commented 8 months ago

Quick update: a naive trie encoding of the Icelandic name set list yields a size of ~10kB gzipped:

Created file 'names-ser.txt'
        Size:           46.02 kB
        Gzip size:      10.61 kB (23.05%)
alexharri commented 8 months ago

Hey @oddsson,

I've created a PR that implements beygla/strict (see #15). Would this implementation work for your use case?

PS: Feel free to review the PR if you've got the time!

oddsson commented 8 months ago

Hey @alexharri, sorry for the radio silence 🤐

Thanks so much for acting on this. You are absolutely correct, beygla declines Carlos correctly. However, our use case is that we are using beygla in a project within the public sector. Our users care a lot about using grammatically correct Icelandic. They do not want to decline foreign names and since there is no way for us to determine the nationality behind a name, we decline everything. This means that our users manually "correct" foreign names after the fact and foreign names are really common within our system.

I'll try beygla strict within our project today or tomorrow and report back 🤝 Thanks again..

oddsson commented 8 months ago

This would definitely work for our use case. If you are happy, I'd really like to see this merged so we can start using it 💯

alexharri commented 8 months ago

@oddsson beygla@1.4.0 has been released to npm, let me know if you run into any issues!