Helsinki-NLP / OPUS-API

API for searching corpora from OPUS
1 stars 0 forks source link

Uppercase and lowercase duplicate languages #4

Open gramirez-prompsit opened 11 months ago

gramirez-prompsit commented 11 months ago

Hi! Some of the languages in the API are repeated being the casing the only difference between (e.g. en_ZA and en_za, es_CL and es_cl). The API answers with a "resources not found" usually for the lowercased version when searching for language pairs (e.g. en_ZA-fr is OK, but en-za-fr is not).

Could you please, take a look? Thanks!

jorgtied commented 10 months ago

Could this be handled by the frontend (mapping all codes to some standard)? Otherwise, we also work on some improvement of OPUS-API to make search for language IDs more flexible / consistent.

gramirez-prompsit commented 10 months ago

No problem! We will filter out these:

"zh_tw" "zh_hk" "zh_cn" "pt_br" "nn_no" "nb_no" "hi_in" "fr_ca" "es_mx" "es_cl" "es_ar" "en_za" "en_gb" "en_ca" "bn_in"