Closed matyaskopp closed 1 year ago
@matyaskopp, Yes, you can use this new listPerson
@matyaskopp, I see that we have a lot of form and syntax warnings, how can we fix that?
Well, maybe @matyaskopp has some better idea, but, in short, you should get a better parser, because the one you are using produces very many illegal UD parses. That said, it is probably too late now anyway for 3.1.
The problem is that ES-CT uses pre-tokenized and pre-sententized input for UDPipe. UDPipe is quite bad for this kind of input.
If you leave tokenization on UDPipe, then everything works, and it never happens that the sentence has multiple roots. I guess that the sentence inside <s>
, in fact, contains multiple trees that do not overlap, so some postprocessing can probably solve it.
It is too late to fix it. Frankly, you have known about problems with linguistic annotation for a few months:
This is probably the reason for the multiple roots:
Your sentences end only with .
or ?
, which I guess is the complete list of characters at the end of sentences in Catalan.
https://github.com/IULATERM-TRL-UPF/ParlaMint_ES-CT/blob/e99e7bf9b7e43d2b30fd473e7ee2fe31540f8c86/src/util_freeling.py#L57
In your implementation: Hola! Com estàs?
is one sentence, but UDPipe sees it as two sentences (= two roots)
@rjzevallos, should we use listPerson, which has been pushed with a pull request for the whole corpus in the release?