Helsinki-NLP / OPUS

The Open Parallel Corpus
58 stars 7 forks source link

Unable to search or download #13

Open sinaahmadi opened 6 months ago

sinaahmadi commented 6 months ago

Hi,

The new website is sleek! However, it seems to have some glitches when it comes to searching or downloading. I have noticed this particularly for languages for which their codes contain the script name like "Central Kurdish" or "Kurdish (Arabic)".

When trying to download NLLB for that language (here: https://opus.nlpl.eu/NLLB/en&ku-Arab/v1/NLLB), searching doesn't return anything. If I try something on NLLB like Tamil-English (ta-eng) and the search works, I can then search the other language code, yet the download links remain the previous one. Ultimately, I get this error: We're sorry, no samples for Kurdish (Arabic) (ku-Arab) - in the[ NLLB](https://opus.nlpl.eu/NLLB/ku-Arab&/v1/NLLB) dataset, version v1 were found. at https://opus.nlpl.eu/sample/ku-Arab&/NLLB&v1/sample.

Thanks for your help.

jorgtied commented 4 months ago

We are looking into this. It seems to be a problem of the OPUS-API. The language pair does not show for some reason. The issue might be related to the way it is specified in the metadata (it says ku_Arab-en instead of en-ku_Arab -- in OPUS the language pair is typically specified by alphabetically sorted language IDs).

In the meantime, you could download the data from the links on the legacy NLLB OPUS site: https://opus.nlpl.eu/legacy/NLLB.php

sinaahmadi commented 4 months ago

Thanks. I have also contacted you many times regarding adding a few parallel corpora for Kurdish. Would you be able to add this to OPUS please? https://github.com/KurdishBLARK/InterdialectCorpus/tree/master

Thanks.