clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

Conllu format for multilingual corpora #653

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Currently, there are created language-separated files for every component file. There is no required way to create ids of sentences and paragraphs - without TEI.ana file, it is not possible to merge two different language conllu files in the order that is in the corresponding TEI.ana component files.

current conllu notes https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-UA/ParlaMint-UA_2014-12-02-m0-uk.conllu#L1-L4

Can probably be extended with an order of utterance/paragraph/sentence to have a navigatable/sortable/unified way of encoding sentence:

# newdoc id = ParlaMint-UA_2014-12-02-m0.u1 
# newdoc_ord = 1
# newpar id = ParlaMint-UA_2014-12-02-m0.u1.p1 
# newpar_ord = 1.1
# sent_id = ParlaMint-UA_2014-12-02-m0.u1.p1.s1 
# sent_ord = 1.1.1
# text = Доброго дня, вельмишановні народні депутати! 

Examples of different id designs: https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-DK/ParlaMint-DK_2014-10-07-20141-M1.conllu#L1-L4 https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-EE/ParlaMint-EE_2015-01-12.conllu#L1-L4 https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-FR/ParlaMint-FR_2018-01-16-O1111.conllu#L1-L4

Examples of unsortable ids: https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-SE/ParlaMint-SE_2016-11-16-prot-201617--29.conllu#L1-L4

TomazErjavec commented 1 year ago

This has all been resolved I think, the final word is in d6216a4: Now bilingual corpora get 3 CoNLL-U files per .ana.xml file:

I just did make conllu and the result is in 5fd0a60.

matyaskopp commented 1 year ago

Nice, I have one note:

  • one complete file, with both languages and metadata % lang = xx

Adds % lang = xx only iff both languages are present in the file, so if there is only one language in conllu file, it is not marked. (Most new UA files)

I think one comment do not bother the reader too much, so % lang = xx can be present in all corpora in a joint conllu file. Do you agree?

TomazErjavec commented 1 year ago

Absolutely, esp. as there is now the bug if the joint conll has no eg russian it also has no % lang. I tend to overcomplicate...

TomazErjavec commented 1 year ago

Done now.