freckle / bcp47

BCP-47 for Haskell
https://freckle.github.io/bcp47/
2 stars 1 forks source link

Stop using enumerative libraries for Language and Region #32

Closed pbrisbin closed 2 years ago

pbrisbin commented 2 years ago

This is related to (and may solve) #21, which describes limitations that we inherit by using iso639 (for what we call Language) and country (for what we call Region). I'm motivated to solve this (and in the way I'll describe) because country is also no longer in Stackage. Bringing it back into Stackage is high-effort because many of its transitive dependents are also no longer in Stackage -- therefore, I'm interested in cutting this dependency (and solving #21 along the way).


As I understand it, the primary use-case for bcp47 is the Trie type, which is a map-like structure for looking up some value by a language-region/script/etc key (BCP47 ~ (Language, [Region|...])). To achieve this functionality as stated, the type we use for the language and region components only needs to be a unique, case-insensitive input in such keys.

We are using the above libraries because they maintain enumerations of known codes. We use complete parsers to ensure the input strings are valid, then we map the parsed textual values into these enumerations to ensure they're known. This is a double-edged sword, as you can see with #21; we would need these libraries to be up to date and flexible enough to support the widest range of values possible.

I'm starting to think we just don't need that final step. As I understand it, the primary value in enforcing known-ness is to prevent use of unknown values through typo. However,

I think that our parsers are sufficient to ensure users enter a valid identifier, and we can use it opaquely without enforcing it's known. Therefore, my proposal is to stop doing it.

I have a branch locally that fully encapsulates the library-provided types behind our own (which I think we should do regardless of the outcome of this Issue),

-- BCP47.Internal.Language
newtype Language = Language
  { unLanguage :: ISO639_1
  }
  deriving stock (Show, Eq, Ord)

-- BCP47.Internal.Region
newtype Region = Region
  { unRegion :: Country
  }
  deriving stock (Show, Eq, Ord)
  deriving newtype (Bounded, Enum)

My proposal would be to then change it to,

 -- BCP47.Internal.Language
 newtype Language = Language
-  { unLanguage :: ISO639_1
+  { unLanguage :: CI.ByteString
   }
   deriving stock (Show, Eq, Ord)

 -- BCP47.Internal.Region
 newtype Region = Region
-  { unRegion :: Country
+  { unRegion :: CI.ByteString
   }
   deriving stock (Show, Eq, Ord)
-  deriving newtype (Bounded, Enum)

That is, just allow and use any case-insensitive textual value that passes our parsers.

@eborden WDYT?

pbrisbin commented 2 years ago

As I understand it, the primary value in enforcing known-ness is to prevent use of unknown values through typo. However,

OK, so it's more than this. Country can parse the 2-alpha, 3-alpha, and 3-digit values to the same concrete value. So, AF, AFG and 004 would all become Afghanistan and compare as the same thing. If we moved to an opaque Region text, these would be 3 distinct values.

I don't think Language has that issue, so we could still move to an opaque textual value there and (I think) solve some of #21. However, Country is the more annoying dependency :thinking:

eborden commented 2 years ago

Language also has this issue. ISO-639-3 added 3 letter language codes so you could represent English as en or eng and those should be equivalent.

pbrisbin commented 2 years ago

Ah yeah, I guess things won't be that easy.