Closed thangld201 closed 1 month ago
According to README.md, the Entity
annotations are in the CorefUD 1.0 format, which is also documented at the UD web. Mentions can be nested and discontinuous, in this format (although I don't see any discontinuous mentions in this Spanish AnCora treebank), so it is highly recommended to use existing APIs for parsing (and serialization) instead of implementing your own. See https://ufal.mff.cuni.cz/corefud/crac24#data-api (I am the author of the Udapi API.)
Yes, the entities in AnCora are in the CorefUD format and they can be processed using Udapi. Entities can be nested and the BIO format cannot express that, so I would avoid the format whenever possible (that is, unless I need to use a tool that requires that format).
Hi @dan-zeman, thank you very much for the works!
I am figuring out how to parse these conllu files and extract annotated named entities from these, as per the conll2003 BIO format. The main hurdle is I do not know when an entity starts and ends in this conllu format, I found beginning and enclosing brackets '(' & ')' to be quite misleading (or are there multiple layer of nested entities ?).
Would really appreciate it if you can help!