Conllu format for multilingual corpora

matyaskopp commented 1 year ago

Currently, there are created language-separated files for every component file. There is no required way to create ids of sentences and paragraphs - without TEI.ana file, it is not possible to merge two different language conllu files in the order that is in the corresponding TEI.ana component files.

current conllu notes https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-UA/ParlaMint-UA_2014-12-02-m0-uk.conllu#L1-L4

Can probably be extended with an order of utterance/paragraph/sentence to have a navigatable/sortable/unified way of encoding sentence:

# newdoc id = ParlaMint-UA_2014-12-02-m0.u1 
# newdoc_ord = 1
# newpar id = ParlaMint-UA_2014-12-02-m0.u1.p1 
# newpar_ord = 1.1
# sent_id = ParlaMint-UA_2014-12-02-m0.u1.p1.s1 
# sent_ord = 1.1.1
# text = Доброго дня, вельмишановні народні депутати!

Examples of different id designs: https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-DK/ParlaMint-DK_2014-10-07-20141-M1.conllu#L1-L4 https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-EE/ParlaMint-EE_2015-01-12.conllu#L1-L4 https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-FR/ParlaMint-FR_2018-01-16-O1111.conllu#L1-L4

Examples of unsortable ids: https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-SE/ParlaMint-SE_2016-11-16-prot-201617--29.conllu#L1-L4

TomazErjavec commented 1 year ago

This has all been resolved I think, the final word is in d6216a4: Now bilingual corpora get 3 CoNLL-U files per .ana.xml file:

one complete file, with both languages and metadata % lang = xx
one file for the first langauge
one file for the second langauge

I just did make conllu and the result is in 5fd0a60.

matyaskopp commented 1 year ago

Nice, I have one note:

one complete file, with both languages and metadata % lang = xx

Adds % lang = xx only iff both languages are present in the file, so if there is only one language in conllu file, it is not marked. (Most new UA files)

I think one comment do not bother the reader too much, so % lang = xx can be present in all corpora in a joint conllu file. Do you agree?

TomazErjavec commented 1 year ago

Absolutely, esp. as there is now the bug if the joint conll has no eg russian it also has no % lang. I tend to overcomplicate...

TomazErjavec commented 1 year ago

Done now.

clarin-eric / ParlaMint

Conllu format for multilingual corpora #653