divvun / CorpusTools

Tools to manage and convert GiellaLT corpus files
https://giellalt.github.io/CorpusTools/
GNU General Public License v3.0
3 stars 0 forks source link

Kven parallel corpus is full of colons #2

Open th0masbk opened 1 year ago

th0masbk commented 1 year ago

The Kven-Norwegian parallel corpus isn't functional because of colons between the words in the Norwegian sentences. Because of this, it is not possible to use "extended search" with multiple words from Norwegian to Kven. This is also the case for some Kven sentences, but not all.

De : som : : ikke : : stiller : med : : | bil | : i : : kortesjen : oppfordres til : : å : : delta : langs : : ruta , : : med : : flagg : og : : hilsener . :\n :\n -- | -- | --

Niitä jokka ei ole myötä piili korteesi ssa pyyethään olemhaan myötä ruutan pitkin fla kk u i n ja tervheisten kans . :\n :\n

albbas commented 1 year ago

Dette ser ut til å være knyttet til korp_parallel.py

albbas commented 11 months ago

Eksempel på søk med dette fenomenet: https://gtweb.uit.no/f_korp/?mode=parallel#?lang=nb&stats_reduce=word&parallel_corpora=fkv&cqp_nob=%5B%5D&corpus=nob2fkv_admin_20210319-nob,nob2fkv_bible_20210319-nob,nob2fkv_facta_20210319-nob,nob2fkv_ficti_20210319-nob,nob2fkv_news_20210319-nob&page=0&cqp_fkv=%5Bword%20%3D%20%22jokka%22%5D&search=cqp%7C%5Bword%20%3D%20%22jokka%22%5D