Open RubenvanHeusden opened 1 year ago
I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.
I don't see why this could be problematic but who knows, it might be. E.g. if MT will take the unannotated version of the corpus to translate, but we would also want to link the MTed version to the .ana version.
is there maybe some way to still keep the reference to the annotated segment?
Under the assumption that you would remove only segments (and not whole utterances), you could give the gap the ID of the deleted segment. Let me know if you decide to do this, as currenlty gap cannot have @xml:id
, I would then need to add it in the schema.
We could also just give up on the idea that segments need to have at least one sentence but this is a worst-case scenario, as we catch quite a few true errors by having this constraint.
After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed...
I don't understand what failed.
Did int-tagger
produce this error, and then the rest of the tools(udify and flair-ner) was not used?
I don't have access to the complete logs of the nlp pipeline as this was done by the Belgian team, but as far as I can tell this happens in the case of very long sentences, or characters that cause the tokenisation to fail, but I am not completely sure about this, I will have to look into this a bit more.
During validation of the final NL corpus, there were several warnings about segments without sentences:
WARN: skipping segment without sentences ParlaMint-NL_2014-04-16-tweedekamer-5.seg6
After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed, in which case we put a gap element in the segment, like below:
@TomazErjavec already suggested using the
reason='processingError'
for this, so the element would becomeAnd to omit the segment / utterance if the error results in empty segments.
I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.
@TomazErjavec @matyaskopp , what do you think about this? Is this ok, or is there maybe some way to still keep the reference to the annotated segment? I could of course remove them from both versions, but as far as I could see, the sentences in the plain text versions were valid sentences, so it would be a shame to leave them out.