How to parse and extract named entity in BIO format from these conllu files ?

UniversalDependencies / UD_Spanish-AnCora

Spanish data from the AnCora corpus.

Other

28 stars 5 forks source link

How to parse and extract named entity in BIO format from these conllu files ? #10

Closed thangld201 closed 1 month ago

thangld201 commented 1 month ago

Hi @dan-zeman, thank you very much for the works!

I am figuring out how to parse these conllu files and extract annotated named entities from these, as per the conll2003 BIO format. The main hurdle is I do not know when an entity starts and ends in this conllu format, I found beginning and enclosing brackets '(' & ')' to be quite misleading (or are there multiple layer of nested entities ?).

Would really appreciate it if you can help!

martinpopel commented 1 month ago

According to README.md, the Entity annotations are in the CorefUD 1.0 format, which is also documented at the UD web. Mentions can be nested and discontinuous, in this format (although I don't see any discontinuous mentions in this Spanish AnCora treebank), so it is highly recommended to use existing APIs for parsing (and serialization) instead of implementing your own. See https://ufal.mff.cuni.cz/corefud/crac24#data-api (I am the author of the Udapi API.)

dan-zeman commented 1 month ago

Yes, the entities in AnCora are in the CorefUD format and they can be processed using Udapi. Entities can be nested and the BIO format cannot express that, so I would avoid the format whenever possible (that is, unless I need to use a tool that requires that format).