MartinoMensio / spacy-dbpedia-spotlight

A spaCy wrapper for DBpedia Spotlight
MIT License
105 stars 11 forks source link

Counting multiple entity mentions #22

Open acxcv opened 1 year ago

acxcv commented 1 year ago

I edited this post. I had been confused with how entity counts are handled in processed docs.

Example

>>> from collections import Counter

text = Barcelona is a city on the coast of northeastern Spain. It is the capital and largest city of the autonomous community of Catalonia, as well as the second most populous municipality of Spain.
doc = nlp(text)

Expected behavior

>>> Counter(doc.ents)
Counter({'city': 2, 'Spain': 2, 'Barcelona': 1, 'Catalan': 1, 'Spanish': 1, 'autonomous community': 1, 'Catalonia': 1, 'populous': 1, 'municipality': 1}) 

Actual behavior

>>> Counter(doc.ents)
Counter({Barcelona: 1, Catalan: 1, Spanish: 1, city: 1, Spain: 1, city: 1, autonomous community: 1, Catalonia: 1, populous: 1, municipality: 1, Spain: 1})

This makes sense because there may be entities that share the same surface form but point to a different identifier in the KG. However, it did cause some confusion with me.

This seems to be equivalent to

>>> Counter(doc.spans['dbpedia_spotlight'])
Counter({Barcelona: 1, Catalan: 1, Spanish: 1, city: 1, Spain: 1, city: 1, autonomous community: 1, Catalonia: 1, populous: 1, municipality: 1, Spain: 1}) 

Solution In order to achieve what I wanted, I needed to call

>>> Counter(ent.kb_id_ for ent in doc.ents)
Counter({'http://dbpedia.org/resource/City': 2, 'http://dbpedia.org/resource/Spain': 2, 'http://dbpedia.org/resource/Province_of_Barcelona': 1, 'http://dbpedia.org/resource/Catalan_language': 1, 'http://dbpedia.org/resource/Spanish_language': 1, 'http://dbpedia.org/resource/Autonomous_communities_of_Spain': 1, 'http://dbpedia.org/resource/Catalonia': 1, 'http://dbpedia.org/resource/Populous_(company)': 1, 'http://dbpedia.org/resource/Municipalities_of_Spain': 1})                                         

In this example, counting kb_id_s led to the same result as counting __str__ properties. However, I would discourage from using __str__ except you're interested in surface forms only.

>>> Counter(ent.__str__() for ent in doc.ents)
Counter({'city': 2, 'Spain': 2, 'Barcelona': 1, 'Catalan': 1, 'Spanish': 1, 'autonomous community': 1, 'Catalonia': 1, 'populous': 1, 'municipality': 1})

If this becomes relevant to other users, I suggest implementing something like doc.unique_ents for the set of entities in the document and doc.unique_ent_counts for the Counter dict.

MartinoMensio commented 1 year ago

Hi @acxcv , Thank you for clarifying your issue. I see that the current way of doing it is not very intuitive with Counter(ent.kb_id_ for ent in doc.ents).

The thing that is making me think at this moment is that we have two different definitions of entities:

The behaviour of the Counter could be changed depending on the definition of __hash__(), __eq__() and __cmp__() that define the identity and comparison between multiple objects. In the spacy world, the entities belong to the Span class and the definition of the hash and comparison are considering also the position (start and end): https://github.com/explosion/spaCy/blob/master/spacy/tokens/span.pyx#L156

The desired behaviour would be: doc.ents[1] == doc.ents[6] because they have the same ID (http://dbpedia.org/resource/Spain) I only see positive points in this result.

But, due to the definition of a Span (the class that holds entities), the result is: doc.ents[1] != doc.ents[6] because they have different starting and ending position

I think one reason of this happening, is also because the standard built-in models of spacy (e.g. en_core_web_md and en_core_web_lg) do not have the concept of entity ID (only NER and not NEL). Maybe in the future this will change as I see they are developing a lot of exciting things on the EntityLinker side.

Unfortunately, I think that it would be better to keep this behaviour in the default operations with the entities. As you suggested, adding new properties in the doc would be better. The new properties would go under the extension object: for example doc._.unique_ents, as putting them directly under doc would be more difficult.

I can provide an implementation, but not very soon because I am quite busy writing my PhD thesis at the moment.

Martino