Closed matyaskopp closed 1 year ago
This has all been resolved I think, the final word is in d6216a4: Now bilingual corpora get 3 CoNLL-U files per .ana.xml file:
I just did make conllu
and the result is in 5fd0a60.
Nice, I have one note:
- one complete file, with both languages and metadata % lang = xx
Adds % lang = xx
only iff both languages are present in the file, so if there is only one language in conllu file, it is not marked. (Most new UA files)
I think one comment do not bother the reader too much, so % lang = xx
can be present in all corpora in a joint conllu file.
Do you agree?
Absolutely, esp. as there is now the bug if the joint conll has no eg russian it also has no % lang. I tend to overcomplicate...
Done now.
Currently, there are created language-separated files for every component file. There is no required way to create ids of sentences and paragraphs - without TEI.ana file, it is not possible to merge two different language conllu files in the order that is in the corresponding TEI.ana component files.
current conllu notes https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-UA/ParlaMint-UA_2014-12-02-m0-uk.conllu#L1-L4
Can probably be extended with an order of utterance/paragraph/sentence to have a navigatable/sortable/unified way of encoding sentence:
Examples of different id designs: https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-DK/ParlaMint-DK_2014-10-07-20141-M1.conllu#L1-L4 https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-EE/ParlaMint-EE_2015-01-12.conllu#L1-L4 https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-FR/ParlaMint-FR_2018-01-16-O1111.conllu#L1-L4
Examples of unsortable ids: https://github.com/clarin-eric/ParlaMint/blob/16d37cc59b6d9615dfe95d66189c1661affaaee2/Data/ParlaMint-SE/ParlaMint-SE_2016-11-16-prot-201617--29.conllu#L1-L4