Bridgeconn / vachan-api

The backend server that hosts all the API logic for the Vachan Engine.
Other
7 stars 20 forks source link

Use bcp-47 tags for languages #139

Closed kavitharaju closed 3 years ago

kavitharaju commented 3 years ago

We now use 3 letter codes to uniquely identify languages. Switch to bcp-47 tags which would give more granularity.

Things to consider

kavitharaju commented 3 years ago

I have gone through various resources that were shared for languages and language codes, and found:

Here are my thoughts on table schema and adding entries to it

CREATE TABLE public.languages (
    language_id SERIAL PRIMARY KEY,
    language_code text UNIQUE NOT NULL,
    language_name text NOT NULL,
    county text,
    is_gateway bool,
    gateway text,
    script text,
    script_direction text DEFAULT 'left-to-right',
    created_at timestamp with time zone DEFAULT NOW(),
    created_user int NULL,
    last_updated_at  timestamp with time zone DEFAULT NOW(),
    last_updated_user int NULL,

);

OR

CREATE TABLE public.languages (
    language_id SERIAL PRIMARY KEY,
    language_code text UNIQUE NOT NULL,
    language_name text NOT NULL,
    metadata jsonb/text,
    script_direction text DEFAULT 'left-to-right'
    created_at timestamp with time zone DEFAULT NOW(),
    created_user int NULL,
    last_updated_at  timestamp with time zone DEFAULT NOW(),
    last_updated_user int NULL,
);

In second option we could add info about script name, country name, alternate names, script name, gateway etc as available to the metadata. If this field is of text data type it would become useful in the partial matching requirement Revant was raising.

For loading data, I am thinking of taking a union of all 3 listed above and keeping it as CSV which can be loaded to DB via seed_DB creation script. The CSV creation would be a one time process, outside of the app.

For creating new languages, we can ask the user for individual info like primary language, script, region/country and form the code based on that. For this we could use the valid codes from iana registry and also will have to give user proper quidelines on how to select them based on the article linked above. This would also require us to create additional APIs for valid language codes, script codes and region codes.

Regarding the naming of tables and files with language codes when they have "-" in them, we had decided to simply remove hyphens. But I see that the codes can have varying number of letters, which could bring in problems. So I am thinking of following a pattern of <lowercase language code><Title case script code><UPPERCASE COUNTRY CODE><everything else lowercase> without hyphen, which I read is a commonly followed convention. My current assumption is that this would work but not sure about all the possibilities.

About editing the entries via PUT API, we can allow the same user who has created the entry or super admin to edit them.

NOTE: I haven't found a source for script direction for these languages anywhere

@joelthe1 What are your thoughts on these?

joelthe1 commented 3 years ago

I think this makes sense. I like the 2nd option better but we would need to have a few util functions that can reliably return the parts of the language code. Also, I wonder if it makes sense to have a column to connect related language codes (e.g. similar_to) that we can pre-seed (along with string matching) for finding connected languages.

For loading data, union sounds good. And I like the idea of having a defined API for language entry (though I feel it would be good to have one POST/PUT API with the parts of the language passed to it as JSON).

The naming convention is okay (though I am wondering of _ and | separators) but we would need to think of the case when we have created and named a table after a language code and then allow the user to edit the language code later.

I can help with finding more info on script direction. Besides, I see this as an initial iteration and we should be okay with updating fields a little later.

kavitharaju commented 3 years ago

Also, I wonder if it makes sense to have a column to connect related language codes (e.g. similar_to) that we can pre-seed (along with string matching) for finding connected languages.

How do we make/identify this connection? Some options coming to my mind, based on data available to us now, are:

kavitharaju commented 3 years ago

About naming, I see that postgres would convert table name to lower case. So the naming convention I suggested might not work to keep all possibilities unique. But I also see that they have an option for quoted table names where special characters are also allowed.So may be we can use names like "en-US_ESV_bible". Need to check its compatibility with SQL Alchemy as we are creating and accessing the tables via it.

joelthe1 commented 3 years ago

I am wondering if we keep a lookup table for linking to the correct table name. I feel that is a little flaky since it is pretty difficult to ascertain all possible ways a table name could be formed by concatenating parts of the name of a project.

kavitharaju commented 3 years ago

I can help with finding more info on script direction.

I found that info is in translation database. Didn't notice as they use abbreviations