Use bcp-47 tags for languages

kavitharaju commented 3 years ago

We now use 3 letter codes to uniquely identify languages. Switch to bcp-47 tags which would give more granularity.

Things to consider

the language codes are used in naming dynamic tables and suggestion model(trie) files. Presence of special characters(-,.) in the codes should not cause issues here
do we need to treat languages as having a hierarchical relation or should they be treated as independent. For example if we have a translation available for en-hi pair, should it be made available to a user doing en-hi-aa or en-us-hi?
Do we pre-load langauges in DB? Do we allow user to add new language? If yes, how do we validate them?

kavitharaju commented 3 years ago

I have gone through various resources that were shared for languages and language codes, and found:

IANA(Internet Assigned Numbers Authority) has a subtag registry that list all valid 2- or 3- letter language codes for 8213 languages and 209 scripts
unfoldingWord's translation database has 8,934 languages and they have additional information of country(even if it is not specified in code), is the language a Gateway language, and which is the gateway language for the language.
venea had a page 279 lanuages. Looking at some langauges and varients here(eg: Arabic) I see they may be listing same langauges listed in the translation database but in different ways. In the case of Arabic, the venea had the varients spoken in different countries as ar-COUNTRYCODE where as uW had different 3-letter codes for each. The ar-COUNTRYCODE system would allow us to identify in the heirarchy
An article with Guidelines on how to form language tags

Here are my thoughts on table schema and adding entries to it

CREATE TABLE public.languages (
    language_id SERIAL PRIMARY KEY,
    language_code text UNIQUE NOT NULL,
    language_name text NOT NULL,
    county text,
    is_gateway bool,
    gateway text,
    script text,
    script_direction text DEFAULT 'left-to-right',
    created_at timestamp with time zone DEFAULT NOW(),
    created_user int NULL,
    last_updated_at  timestamp with time zone DEFAULT NOW(),
    last_updated_user int NULL,

);

OR

CREATE TABLE public.languages (
    language_id SERIAL PRIMARY KEY,
    language_code text UNIQUE NOT NULL,
    language_name text NOT NULL,
    metadata jsonb/text,
    script_direction text DEFAULT 'left-to-right'
    created_at timestamp with time zone DEFAULT NOW(),
    created_user int NULL,
    last_updated_at  timestamp with time zone DEFAULT NOW(),
    last_updated_user int NULL,
);

In second option we could add info about script name, country name, alternate names, script name, gateway etc as available to the metadata. If this field is of text data type it would become useful in the partial matching requirement Revant was raising.

For loading data, I am thinking of taking a union of all 3 listed above and keeping it as CSV which can be loaded to DB via seed_DB creation script. The CSV creation would be a one time process, outside of the app.

For creating new languages, we can ask the user for individual info like primary language, script, region/country and form the code based on that. For this we could use the valid codes from iana registry and also will have to give user proper quidelines on how to select them based on the article linked above. This would also require us to create additional APIs for valid language codes, script codes and region codes.

Regarding the naming of tables and files with language codes when they have "-" in them, we had decided to simply remove hyphens. But I see that the codes can have varying number of letters, which could bring in problems. So I am thinking of following a pattern of <lowercase language code><Title case script code><UPPERCASE COUNTRY CODE><everything else lowercase> without hyphen, which I read is a commonly followed convention. My current assumption is that this would work but not sure about all the possibilities.

About editing the entries via PUT API, we can allow the same user who has created the entry or super admin to edit them.

NOTE: I haven't found a source for script direction for these languages anywhere

@joelthe1 What are your thoughts on these?

joelthe1 commented 3 years ago

I think this makes sense. I like the 2nd option better but we would need to have a few util functions that can reliably return the parts of the language code. Also, I wonder if it makes sense to have a column to connect related language codes (e.g. similar_to) that we can pre-seed (along with string matching) for finding connected languages.

For loading data, union sounds good. And I like the idea of having a defined API for language entry (though I feel it would be good to have one POST/PUT API with the parts of the language passed to it as JSON).

The naming convention is okay (though I am wondering of _ and | separators) but we would need to think of the case when we have created and named a table after a language code and then allow the user to edit the language code later.

I can help with finding more info on script direction. Besides, I see this as an initial iteration and we should be okay with updating fields a little later.

kavitharaju commented 3 years ago

Also, I wonder if it makes sense to have a column to connect related language codes (e.g. similar_to) that we can pre-seed (along with string matching) for finding connected languages.

How do we make/identify this connection? Some options coming to my mind, based on data available to us now, are:

Common Gateway language
Common county/region
Similar names(do not how effective it would be)

kavitharaju commented 3 years ago

About naming, I see that postgres would convert table name to lower case. So the naming convention I suggested might not work to keep all possibilities unique. But I also see that they have an option for quoted table names where special characters are also allowed.So may be we can use names like "en-US_ESV_bible". Need to check its compatibility with SQL Alchemy as we are creating and accessing the tables via it.

joelthe1 commented 3 years ago

I am wondering if we keep a lookup table for linking to the correct table name. I feel that is a little flaky since it is pretty difficult to ascertain all possible ways a table name could be formed by concatenating parts of the name of a project.

kavitharaju commented 3 years ago

I can help with finding more info on script direction.

I found that info is in translation database. Didn't notice as they use abbreviations

Bridgeconn / vachan-api

Use bcp-47 tags for languages #139