Diaoul / babelfish

BabelFish is a Python library to work with countries and languages
BSD 3-Clause "New" or "Revised" License
25 stars 14 forks source link

Add new converters which uses country demonyms to convert languages #38

Open ratoaq2 opened 1 year ago

ratoaq2 commented 1 year ago

I'm looking for a way to better identify languages in media tracks (mainly audio and subtitle tracks). Usually default tags from media tracks are not precise. Very rarely you get an audio or subtitle track with the correct IETF tag for pt-BR or es-MX or other languages. 99% of the time they are just marked as pt or es and it's very common to have 2 or more tracks with the same language code:

        {
            "codec": "SubRip/SRT",
            "id": 19,
            "properties": {
                "codec_id": "S_TEXT/UTF8",
                "codec_private_length": 0,
                "default_track": false,
                "enabled_track": true,
                "encoding": "UTF-8",
                "forced_track": false,
                "language": "por",
                "language_ietf": "pt",
                "number": 20,
                "text_subtitles": true,
                "track_name": "Português",
                "uid": 1602227994484803173
            },
            "type": "subtitles"
        },
        {
            "codec": "SubRip/SRT",
            "id": 20,
            "properties": {
                "codec_id": "S_TEXT/UTF8",
                "codec_private_length": 0,
                "default_track": false,
                "enabled_track": true,
                "encoding": "UTF-8",
                "forced_track": false,
                "language": "por",
                "language_ietf": "pt",
                "number": 21,
                "text_subtitles": true,
                "track_name": "Português (Brasil)",
                "uid": 17784914655403220205
            },
            "type": "subtitles"
        },

In order to solve this, most likely an approach like guessit is needed. While analysing a large dataset from audio tracks and subtitle tracks, part of them use the official language name in english with the country demonym:

Brazilian Portuguese
British English
American English
French Canadian

I know babelfish is a very concise library that does one thing and it does it well. And to solve this issue I'll need to create extensions (language and country converters) that are outside babelfish scope.

But this little piece related to country demonyms seems a nice feature to be included in babelfish. Maybe something like this:

>>> import babelfish
>>> babelfish.Country.fromname('France')
<Country [FR]>
>>> babelfish.Country.fromdemonym('French')
<Country [FR]>
>>> import babelfish
>>> babelfish.Language.fromname('Portuguese')
<Language [pt]>
>>> babelfish.Language.fromname('Brazilian Portuguese')
<Language [pt-BR]>
>>> babelfish.Language.fromname('Swiss German')
<Language [de-CH]>

I believe babelfish could have at least the demonyms in English and use that to parse the language. I could try to contribute with this part if you think it makes sense to be part of babelfish.

Some references: https://en.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations https://github.com/porimol/countryinfo#demonym https://gist.github.com/consti/e2c7ddc64f0aa044a8b4fcd28dba0700 https://github.com/mledoze/countries/blob/master/countries.json

Diaoul commented 1 year ago

Thanks for the suggestion! I think it's OK for babelfith to provide such a converter 🤩

Would you be able to contribute it?

ratoaq2 commented 1 year ago

I worked a bit more on my issue and I'll really need to go with the guessit approach for it, since it might have so many variations and should also detect the name in the native language. So, the babelfish.Language.fromname('Brazilian Portuguese') would not be useful for me (at least for now).

But the country converter ( babelfish.Country.fromdemonym('French') ) still looks a nice addition from my perspective. Yes, I can contribute with it. The main issue that I see is: there's no ISO standard behind it, so it's a matter of checking which source (or collection of sources) to use.

Related to the language part, as you said, it might be confusing to mix with fromname. And babelfish also has languages with long names with spaces. Still thinking if and how this converter should work.

So, I propose to leave the language out for the time being and when I get a bit of time on this, I can create a PR with a country converter to be evaluated.

Diaoul commented 1 year ago

Works for me! Thanks for spending time on this! 🙏