Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
697 stars 132 forks source link

Saami, Kven dictionaries with example sentences #587

Closed unhammer closed 5 years ago

unhammer commented 9 years ago

http://giellatekno.uit.no/words/dicts/dict-stardict.eng.html has some dictionaries of

that might be mass-added to tatoeba. The svn url is https://victorio.uit.no/langtech/trunk/words/dicts (subdirs smenob, smanob and fkvnob), license http://creativecommons.org/licenses/by/3.0/no/deed.en

It's probably easier to get the examples from the xml in SVN than the stardict files.

jiru commented 9 years ago

So if I understand correctly, the example sentences are in <x> and <xt> pairs. I don’t understand these languages, but I can see a few problems.

# missing initial capital
grep -r '<xt\?>[^[:upper:]]' dicts/smenob/src
grep -r '<xt\?>[^[:upper:]]' dicts/smanob/src
grep -r '<xt\?>[^[:upper:]]' dicts/fkvnob/src/

# annotations
grep -r '<xt\?>' dicts/smenob/src | grep '('
grep -r '<xt\?>' dicts/smanob/src | grep '('

So the sentences would probably need some manual review first.

unhammer commented 9 years ago

gillux notifications@github.com writes:

missing initial capital

grep -r '<xt\?>[^[:upper:]]' dicts/smenob/src grep -r '<xt\?>[^[:upper:]]' dicts/smanob/src grep -r '<xt\?>[^[:upper:]]' dicts/fkvnob/src/

As an example,

dicts/smenob/src/V_smenob.xml:               <x>dodjalit earáide ovddasvástádusa</x>
dicts/smenob/src/V_smenob.xml:               <xt>skyve ansvaret over på andre</xt>

means "push the responsibility on to someone else", so if you want complete sentences you'd have to grep -v those yeah.

annotations

grep -r '<xt\?>' dicts/smenob/src | grep '(' grep -r '<xt\?>' dicts/smanob/src | grep '('

dicts/smenob/src/Adv_smenob.xml:               <xt>Hvor (på kroppen) er du blitt operert?</xt>

means "Where (on the body) have you been operated?"

So the sentences would probably need some manual review first.

As always, I hope? :)

trang commented 5 years ago

We can take care of implementing the necessary features for people to mass import sentences but we cannot take care of extracting and curating sentences from a dictionary or any linguistic source.

Closing this now.