clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Segments without sentences in annotated version #555

Open RubenvanHeusden opened 1 year ago

RubenvanHeusden commented 1 year ago

During validation of the final NL corpus, there were several warnings about segments without sentences:

WARN: skipping segment without sentences ParlaMint-NL_2014-04-16-tweedekamer-5.seg6

After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed, in which case we put a gap element in the segment, like below:

 <seg xml:id="ParlaMint-NL_2014-04-16-tweedekamer-5.seg6">
    <gap reason="editorial">
       <desc>Sentence could not be parsed: text_of_unparsable_sentence</desc>
    </gap>
</seg>

@TomazErjavec already suggested using the reason='processingError' for this, so the element would become

 <seg xml:id="ParlaMint-NL_2014-04-16-tweedekamer-5.seg6">
    <gap reason="processingError">
       <desc>text_of_unparsable_sentence</desc>
    </gap>
</seg>

And to omit the segment / utterance if the error results in empty segments.

I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.

@TomazErjavec @matyaskopp , what do you think about this? Is this ok, or is there maybe some way to still keep the reference to the annotated segment? I could of course remove them from both versions, but as far as I could see, the sentences in the plain text versions were valid sentences, so it would be a shame to leave them out.

TomazErjavec commented 1 year ago

I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.

I don't see why this could be problematic but who knows, it might be. E.g. if MT will take the unannotated version of the corpus to translate, but we would also want to link the MTed version to the .ana version.

is there maybe some way to still keep the reference to the annotated segment?

Under the assumption that you would remove only segments (and not whole utterances), you could give the gap the ID of the deleted segment. Let me know if you decide to do this, as currenlty gap cannot have @xml:id, I would then need to add it in the schema.

We could also just give up on the idea that segments need to have at least one sentence but this is a worst-case scenario, as we catch quite a few true errors by having this constraint.

matyaskopp commented 1 year ago

After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed...

I don't understand what failed.

Did int-tagger produce this error, and then the rest of the tools(udify and flair-ner) was not used?

RubenvanHeusden commented 1 year ago

I don't have access to the complete logs of the nlp pipeline as this was done by the Belgian team, but as far as I can tell this happens in the case of very long sentences, or characters that cause the tokenisation to fail, but I am not completely sure about this, I will have to look into this a bit more.