facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.83k stars 4.71k forks source link

Language identification - 'sh' code #964

Open einrogerst opened 4 years ago

einrogerst commented 4 years ago

The fastText language identification models support language code 'sh' (https://fasttext.cc/docs/en/language-identification.html). However, this code is not listed in the ISO codes (https://www.loc.gov/standards/iso639-2/php/code_list.php). It is unclear if it refers to Shan language (shn), Shona language (sna), or any other language.

rspeer commented 4 years ago

I'm not a fasttext developer, but I came across this.

sh is the code for Serbo-Croatian. It's vaguely deprecated and is considered equivalent to Serbian (sr), Croatian (hr), or Bosnian (bs), three highly-related but politically-distinct languages that are mostly indistinguishable in text.