Open ratoaq2 opened 1 year ago
Thanks for the suggestion! I think it's OK for babelfith to provide such a converter 🤩
fromdemonym
approach 👍 fromname
though which is based on the ISO standard and could create some confusion 🤔 Would you be able to contribute it?
I worked a bit more on my issue and I'll really need to go with the guessit approach for it, since it might have so many variations and should also detect the name in the native language. So, the babelfish.Language.fromname('Brazilian Portuguese')
would not be useful for me (at least for now).
But the country converter ( babelfish.Country.fromdemonym('French')
) still looks a nice addition from my perspective.
Yes, I can contribute with it. The main issue that I see is: there's no ISO standard behind it, so it's a matter of checking which source (or collection of sources) to use.
Related to the language part, as you said, it might be confusing to mix with fromname
. And babelfish also has languages with long names with spaces. Still thinking if and how this converter should work.
So, I propose to leave the language out for the time being and when I get a bit of time on this, I can create a PR with a country converter to be evaluated.
Works for me! Thanks for spending time on this! 🙏
I'm looking for a way to better identify languages in media tracks (mainly audio and subtitle tracks). Usually default tags from media tracks are not precise. Very rarely you get an audio or subtitle track with the correct IETF tag for
pt-BR
ores-MX
or other languages. 99% of the time they are just marked aspt
ores
and it's very common to have 2 or more tracks with the same language code:In order to solve this, most likely an approach like
guessit
is needed. While analysing a large dataset from audio tracks and subtitle tracks, part of them use the official language name in english with the country demonym:I know babelfish is a very concise library that does one thing and it does it well. And to solve this issue I'll need to create extensions (language and country converters) that are outside babelfish scope.
But this little piece related to country demonyms seems a nice feature to be included in babelfish. Maybe something like this:
I believe babelfish could have at least the demonyms in English and use that to parse the language. I could try to contribute with this part if you think it makes sense to be part of babelfish.
Some references: https://en.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations https://github.com/porimol/countryinfo#demonym https://gist.github.com/consti/e2c7ddc64f0aa044a8b4fcd28dba0700 https://github.com/mledoze/countries/blob/master/countries.json