clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

multilingual corpora - fix conllu conversion script #561

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Currently, only BE uses multiple languages in settings: https://github.com/clarin-eric/ParlaMint/blob/1a838f7d3435941d7e9e06d9ecfdba52fe141dac/Scripts/parlamint2conllu.pl#L32-L62

But there are more parliaments with multiple languages...

TomazErjavec commented 1 year ago

I see this has already been done. However, I just noticed that we have two languages in NO utterances, namely nno and nob. I fixed this in 5a13cd6.

But, there are two issues here, @tungland:

TomazErjavec commented 1 year ago

I occurs to me that maybe I wasn't clear, even to myself: in case NO has been processed simply with a "no" pipeline, we can revert the languages for NO to just "no", as it was, and the problem with multiple ConLL-U files goes away.

tungland commented 1 year ago

@TomazErjavec

ParlaMint we use two-char language codes, when possible. Any special reason you don't use nb and nn?

No reason. Where did i use 3 letter code?

Did you actually run you annotation twice, once for nno and once for nob, and inserted the results in the appropriate utterances? Or is this all done with generic no pipeline

Out model supports both language modes of Norwegian, so yes there was only one annotation for Norwegian.

TomazErjavec commented 1 year ago

ParlaMint we use two-char language codes, when possible. Any special reason you don't use nb and nn?

No reason. Where did i use 3 letter code?

@tungland, like here: https://github.com/clarin-eric/ParlaMint/blob/d6ca7bdfa0e2a4394c4f1b8e2921c98c6c1b3fb7/Data/ParlaMint-NO/ParlaMint-NO_1999-03-02-lower.ana.xml#L125

Did you actually run you annotation twice, once for nno and once for nob, and inserted the results in the appropriate utterances? Or is this all done with generic no pipeline

Out model supports both language modes of Norwegian, so yes there was only one annotation for Norwegian.

OK, reverted this change then.

tungland commented 1 year ago

like here: https://github.com/clarin-eric/ParlaMint/blob/d6ca7bdfa0e2a4394c4f1b8e2921c98c6c1b3fb7/Data/ParlaMint-NO/ParlaMint-NO_1999-03-02-lower.ana.xml#L125

Ah ok. I must have missed this preference. Is this blocking submission?

TomazErjavec commented 1 year ago

I must have missed this preference. Is this blocking submission?

No, but you might consider fixing it for 3.1.

tungland commented 1 year ago

I'll make a note of it! Thanks!

TomazErjavec commented 1 year ago

I'll make a note of it!

I already did with milestone 3.1. :)

TomazErjavec commented 1 year ago

This has all been resolved I think, the final word is in d6216a4: Now bilingual corpora get 3 CoNLL-U files per .ana.xml file: