Counting multiple entity mentions

MartinoMensio / spacy-dbpedia-spotlight

A spaCy wrapper for DBpedia Spotlight

MIT License

105 stars 11 forks source link

>>> from collections import Counter text = Barcelona is a city on the coast of northeastern Spain. It is the capital and largest city of the autonomous community of Catalonia, as well as the second most populous municipality of Spain. doc = nlp(text)

>>> Counter(ent.kb_id_ for ent in doc.ents) Counter({'http://dbpedia.org/resource/City': 2, 'http://dbpedia.org/resource/Spain': 2, 'http://dbpedia.org/resource/Province_of_Barcelona': 1, 'http://dbpedia.org/resource/Catalan_language': 1, 'http://dbpedia.org/resource/Spanish_language': 1, 'http://dbpedia.org/resource/Autonomous_communities_of_Spain': 1, 'http://dbpedia.org/resource/Catalonia': 1, 'http://dbpedia.org/resource/Populous_(company)': 1, 'http://dbpedia.org/resource/Municipalities_of_Spain': 1})

>>> Counter(ent.__str__() for ent in doc.ents) Counter({'city': 2, 'Spain': 2, 'Barcelona': 1, 'Catalan': 1, 'Spanish': 1, 'autonomous community': 1, 'Catalonia': 1, 'populous': 1, 'municipality': 1})

Hi @acxcv , Thank you for clarifying your issue. I see that the current way of doing it is not very intuitive with Counter(ent.kb_id_ for ent in doc.ents).

The thing that is making me think at this moment is that we have two different definitions of entities:

the first one, which is the most natural to think about, is the entity itself. "Spain" in your text appears twice, and therefore it should be counted as one single entity with 2 appearances.
the second one, that instead should probably be called "entity mention" or "entity appearance" (in the Spacy world they are Span), which also contains the information about where in the text it appears (1st "Spain" with ent.start=9 and ent.start_char=49, while the 2nd "Spain" with ent.start=34 and ent.start_char=185).

The behaviour of the Counter could be changed depending on the definition of __hash__(), __eq__() and __cmp__() that define the identity and comparison between multiple objects. In the spacy world, the entities belong to the Span class and the definition of the hash and comparison are considering also the position (start and end): https://github.com/explosion/spaCy/blob/master/spacy/tokens/span.pyx#L156

The desired behaviour would be: doc.ents[1] == doc.ents[6] because they have the same ID (http://dbpedia.org/resource/Spain) I only see positive points in this result.

But, due to the definition of a Span (the class that holds entities), the result is: doc.ents[1] != doc.ents[6] because they have different starting and ending position

I think one reason of this happening, is also because the standard built-in models of spacy (e.g. en_core_web_md and en_core_web_lg) do not have the concept of entity ID (only NER and not NEL). Maybe in the future this will change as I see they are developing a lot of exciting things on the EntityLinker side.

Unfortunately, I think that it would be better to keep this behaviour in the default operations with the entities. As you suggested, adding new properties in the doc would be better. The new properties would go under the extension object: for example doc._.unique_ents, as putting them directly under doc would be more difficult.

I can provide an implementation, but not very soon because I am quite busy writing my PhD thesis at the moment.

Martino

MartinoMensio / spacy-dbpedia-spotlight

Counting multiple entity mentions #22