apertium / apertium-eng-spa

Apertium translation pair for English and Spanish
GNU General Public License v2.0
2 stars 8 forks source link

Find out if WikDict dictionaries can be used to improve vocabulary #4

Open karlb opened 6 years ago

karlb commented 6 years ago

I'm the developer of http://www.wikdict.com and I'm considering to use the generated dictionaries to improve apertium. If this works out for one language pair, I'll be able to provide the same for many additional language pairs. The data comes originally from Wiktionary and is licensed under CC-BY-SA 3.0. The same process might be usable for the dictionaries from http://www.freedict.org , but those are less homogenous, so I'll leave that for later.

I've done a quick first try to convert entries and would like some feedback on the current state.

Example:

<e><p><l>house<s n="n"></l><r>casa</r></p></e>
<e><p><l>house<s n="v"></l><r>alojar</r></p></e>
<e><p><l>house<s n="v"></l><r>envolver</r></p></e>
<e><p><l>house<s n="v"></l><r>almacenar</r></p></e>
<e><p><l>house<s n="v"></l><r>albergar</r></p></e>
<e><p><l>house<s n="v"></l><r>hospedar</r></p></e>
<e><p><l>house<s n="v"></l><r>encajar</r></p></e>

Full data at: http://download.wikdict.com/apertium/

Things to note:

My main question is: how close is this to being usable for Apertium and which are the minimum Todos before it will get any usage? It's obvious to me that this is not ready, yet. But I would like to have a realistic overview whether I can get it in a useable state at all before doing more complicated steps.

xavivars commented 5 years ago

There are some constraints: in order to have this working easily, only one entry can appear with the same words in the bilingual dictionary.

As an example, extracted from your file, I can see the following

<e><p><l>zinc<s n="n"></l><r>cinc</r></p></e>
<e><p><l>zinc<s n="n"></l><r>zinc</r></p></e>

On top of the missing POS in Spanish, as you already mention, this should be changed to either

<e><p><l>zinc<s n="n"></l><r>cinc</r></p></e>
<e r="RL"><p><l>zinc<s n="n"></l><r>zinc</r></p></e>

(restricting the second entry from Spanish to English only) or just

<e><p><l>zinc<s n="n"></l><r>cinc</r></p></e>

removing the second entry completely.

Then, we would need to make sure the words exist in both monolingual dictionaries.

And there's one last thing: language differences: in Spanish, nouns have a gender, and Apertium's bilingual dictionaries need to be tweaked to transfer the genders appropriately across languages.

As a summary: your approach would probably work between similar languages (romanic languages: French, Italian, Spanish, Catalan, Portuguese...) where all of them have more or less the same characteristics, but would be a bit harder to make it work with other language pairs.

karlb commented 5 years ago

Thanks for the feedback! Can anyone judge how good or bad the license situation is?

ftyers commented 5 years ago

@karlb there is no problem with the licence situation

@xavivars it's better to just leave the entries without direction restrictions, and deal with the defaults in the .lrx files (imho).