clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

ES-CT: translations and data update #777

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

@rjzevallos, should we use listPerson, which has been pushed with a pull request for the whole corpus in the release?

rjzevallos commented 1 year ago

@matyaskopp, Yes, you can use this new listPerson

rjzevallos commented 1 year ago

@matyaskopp, I see that we have a lot of form and syntax warnings, how can we fix that?

TomazErjavec commented 1 year ago

Well, maybe @matyaskopp has some better idea, but, in short, you should get a better parser, because the one you are using produces very many illegal UD parses. That said, it is probably too late now anyway for 3.1.

matyaskopp commented 1 year ago

The problem is that ES-CT uses pre-tokenized and pre-sententized input for UDPipe. UDPipe is quite bad for this kind of input. If you leave tokenization on UDPipe, then everything works, and it never happens that the sentence has multiple roots. I guess that the sentence inside <s>, in fact, contains multiple trees that do not overlap, so some postprocessing can probably solve it. It is too late to fix it. Frankly, you have known about problems with linguistic annotation for a few months:

This is probably the reason for the multiple roots: Your sentences end only with . or ?, which I guess is the complete list of characters at the end of sentences in Catalan. https://github.com/IULATERM-TRL-UPF/ParlaMint_ES-CT/blob/e99e7bf9b7e43d2b30fd473e7ee2fe31540f8c86/src/util_freeling.py#L57 In your implementation: Hola! Com estàs? is one sentence, but UDPipe sees it as two sentences (= two roots)