Open acxcv opened 1 year ago
Hi @acxcv ,
Thank you for clarifying your issue. I see that the current way of doing it is not very intuitive with Counter(ent.kb_id_ for ent in doc.ents)
.
The thing that is making me think at this moment is that we have two different definitions of entities:
Span
), which also contains the information about where in the text it appears (1st "Spain" with ent.start=9 and ent.start_char=49
, while the 2nd "Spain" with ent.start=34 and ent.start_char=185
).The behaviour of the Counter
could be changed depending on the definition of __hash__()
, __eq__()
and __cmp__()
that define the identity and comparison between multiple objects. In the spacy world, the entities belong to the Span
class and the definition of the hash and comparison are considering also the position (start and end): https://github.com/explosion/spaCy/blob/master/spacy/tokens/span.pyx#L156
The desired behaviour would be:
doc.ents[1] == doc.ents[6]
because they have the same ID (http://dbpedia.org/resource/Spain
) I only see positive points in this result.
But, due to the definition of a Span (the class that holds entities), the result is:
doc.ents[1] != doc.ents[6]
because they have different starting and ending position
I think one reason of this happening, is also because the standard built-in models of spacy (e.g. en_core_web_md
and en_core_web_lg
) do not have the concept of entity ID (only NER and not NEL). Maybe in the future this will change as I see they are developing a lot of exciting things on the EntityLinker
side.
Unfortunately, I think that it would be better to keep this behaviour in the default operations with the entities.
As you suggested, adding new properties in the doc
would be better.
The new properties would go under the extension object: for example doc._.unique_ents
, as putting them directly under doc
would be more difficult.
I can provide an implementation, but not very soon because I am quite busy writing my PhD thesis at the moment.
Martino
I edited this post. I had been confused with how entity counts are handled in processed docs.
Example
Expected behavior
Actual behavior
This makes sense because there may be entities that share the same surface form but point to a different identifier in the KG. However, it did cause some confusion with me.
This seems to be equivalent to
Solution In order to achieve what I wanted, I needed to call
In this example, counting
kb_id_
s led to the same result as counting__str__
properties. However, I would discourage from using__str__
except you're interested in surface forms only.If this becomes relevant to other users, I suggest implementing something like
doc.unique_ents
for the set of entities in the document anddoc.unique_ent_counts
for the Counter dict.