Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
719 stars 132 forks source link

Import Icelandic-Czech sentences from hvalur.org #2632

Open Yorwba opened 3 years ago

Yorwba commented 3 years ago

Last week, chejnik from the Icelandic-Czech dictionary hvalur.org asked in the XMPP chat about using Tatoeba's exported data. I explained that there currently aren't many directly-linked sentence pairs in those two languages, but that the indirectly-linked pairs might be usable with caveats. Then chejnik made the counter-offer of adding hvalur.org's existing example sentences to Tatoeba.

The original data is available in CSV format from hvalur.org's download page, but I noticed that there are some sentences with slashes or parentheses to indicate alternative translations, so I cleaned those up (reducing the line count from 2494 to 2473) and turned it into our usual TSV format while I was at. I uploaded the result as a GitHub gist.

Related: #2256

ckjpn commented 3 years ago

This is something I could import for him/her easily, assigning his/her username as the owner, if the "mass import" function ever gets migrated.

So, once the following is done, then this issue can be taken care of.

https://github.com/Tatoeba/tatoeba2/issues/1762