How to return the specified label for entities?

xyFreddie commented 2 years ago

I want to use this to annotate my training data. For example I specified DBpedia:Album, DBpedia:Song, DBpedia:MusicalArtist If I print ent.label_, it would only return 'DBPEDIA_ENT' for all entities. Is there any way to actually retrieve Album, Song, etc? Additional question, what's the format of text if I want to use this as training data input of spacy? Thank you in advance

MartinoMensio commented 2 years ago

Hi @xyLinear , Thank you for opening this issue. Originally when developing this package, I wanted to use the correct labels in ent.label_ as you are saying. Then a little technicality raised: DBpedia-spotlight is giving a list of types and spaCy only allows for a single type for each entity. So I made the decision to keep only one type DBPEDIA_ENT and to let the details of the types in the span._.dbpedia_raw_result.

So what I would suggest, with the current state of this library, is to use the following snippet:

# setting up
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight')
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)

# see which entities have been recognised
print(doc.ents)
# ---> (Please Please Me, studio album, English, rock)

# see the complete details for the first entity
doc.ents[0]._.dbpedia_raw_result
# --> {'@URI': 'http://dbpedia.org/resource/Please_Please_Me', '@support': '224', '@types': 'Wikidata:Q482994,Wikidata:Q386724,Wikidata:Q2188189,Schema:MusicAlbum,Schema:CreativeWork,DBpedia:Work,DBpedia:MusicalWork,DBpedia:Album', '@surfaceForm': 'Please Please Me', '@offset': '0', '@similarityScore': '0.999998495059141', '@percentageOfSecondRank': '1.5049431293805952E-6'}

# extract the types for each entity, which are separated by commas
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]
for ent, types in ents_with_types:
    print(ent, '\t', types)
# -->
# Please Please Me         ['Wikidata:Q482994', 'Wikidata:Q386724', 'Wikidata:Q2188189', 'Schema:MusicAlbum', # 'Schema:CreativeWork', 'DBpedia:Work', 'DBpedia:MusicalWork', 'DBpedia:Album']
# studio album     ['']
# English          ['Wikidata:Q315', 'Schema:Language', 'DBpedia:Language']
# rock     ['Wikidata:Q188451', 'DUL:Concept', 'DBpedia:TopicalConcept', 'DBpedia:Genre', 'DBpedia:MusicGenre']

Now as you can see it's not the most straightforward to do. But at the moment I cannot think of an easy solution that is compatible with spaCy. Do you think maybe having the list of types already in e.g. span._.dbpedia_types could be a good idea? So that you could have a simpler coding:

# this is not implemented
for ent in doc.ents:
    print(ent, '\t', ent._.dbpedia_types)

For your second question instead, let me just check if I understood correctly what you want to do:

perform NER with DBpedia-spotlight
Train a spaCy NER model with the entities from step 1
Use this offline NER model without the need to use DBpedia-spotlight

If that's correct, this is what you can do:

Step 1: you need to define which entity types you want to use from DBpedia-spotlight, and in which order of priority. For example, if an entity is both of types 'DBpedia:Genre', 'DBpedia:MusicGenre', what do you prefer? So you need to define some rules for example basing your decision on a priority list.

Example:

# list of types of interests, order matters
priority_list = ['DBpedia:MusicGenre', 'DBpedia:Album', 'DBpedia:Song', 'DBpedia:MusicalArtist']

# example document as before
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)

# now get the types as before
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]

# now select a single type and only for the entities with type in priority_list
ents_selected_with_type = []
for ent, types in ents_with_types:
    ent_type = next((t for t in priority_list if t in types), None)
    if ent_type:
        ents_selected_with_type.append((ent, ent_type))

# now we only have 2 entities and their type, this is our gold standard for training
print(ents_selected_with_type)

# convert to the training format of spaCy
TRAIN_DATA = [text, {'entities': [(ent.start_char, ent.end_char, ent_type) for ent, ent_type in ents_selected_with_type]}]

Step 2: use the approach from this article: https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7 TRAIN_DATA format is the same, so it should be compatible.

Best, Martino

xyFreddie commented 2 years ago

Thank you so much! That helped a lot. Another quick question: Do I preserve the sentences that don't contain any desired entities in training data? Or can I just drop them, so the training data would only contain sentences that have at least one entity

MartinoMensio commented 2 years ago

Hi @xyLinear , That's a good question! I don't know the common practice for this type of training! This is my wild gues:

training also on sentences that do not contain the wanted types of entities could be helpful to reduce false positives (your model guessing it's an entity but it's not actually)
having too many sentences that don't have anything could just mean noise in training, and reduce your metrics

Therefore, I would consider two things:

trying with different proportions of sentences with entities and without, and take a look at the precision/recall metrics of your desired entities. So that you can get a better idea of what the best option is
play with the confidence configuration parameter of the DBpedia-spotlight API when you initialize the pipeline (you can also specify the types wanted there): nlp.add_pipe('dbpedia_spotlight', config={'types': 'DBpedia:Place,DBpedia:Genre', 'confidence': 0.25}). You may want also entities with lower confidence than the default value

Good luck! Martino

MartinoMensio / spacy-dbpedia-spotlight

How to return the specified label for entities? #17