MartinoMensio / spacy-dbpedia-spotlight

A spaCy wrapper for DBpedia Spotlight
MIT License
105 stars 11 forks source link

How to return the specified label for entities? #17

Open xyFreddie opened 2 years ago

xyFreddie commented 2 years ago

I want to use this to annotate my training data. For example I specified DBpedia:Album, DBpedia:Song, DBpedia:MusicalArtist If I print ent.label_, it would only return 'DBPEDIA_ENT' for all entities. Is there any way to actually retrieve Album, Song, etc? Additional question, what's the format of text if I want to use this as training data input of spacy? Thank you in advance

MartinoMensio commented 2 years ago

Hi @xyLinear , Thank you for opening this issue. Originally when developing this package, I wanted to use the correct labels in ent.label_ as you are saying. Then a little technicality raised: DBpedia-spotlight is giving a list of types and spaCy only allows for a single type for each entity. So I made the decision to keep only one type DBPEDIA_ENT and to let the details of the types in the span._.dbpedia_raw_result.

So what I would suggest, with the current state of this library, is to use the following snippet:

# setting up
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight')
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)

# see which entities have been recognised
print(doc.ents)
# ---> (Please Please Me, studio album, English, rock)

# see the complete details for the first entity
doc.ents[0]._.dbpedia_raw_result
# --> {'@URI': 'http://dbpedia.org/resource/Please_Please_Me', '@support': '224', '@types': 'Wikidata:Q482994,Wikidata:Q386724,Wikidata:Q2188189,Schema:MusicAlbum,Schema:CreativeWork,DBpedia:Work,DBpedia:MusicalWork,DBpedia:Album', '@surfaceForm': 'Please Please Me', '@offset': '0', '@similarityScore': '0.999998495059141', '@percentageOfSecondRank': '1.5049431293805952E-6'}

# extract the types for each entity, which are separated by commas
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]
for ent, types in ents_with_types:
    print(ent, '\t', types)
# -->
# Please Please Me         ['Wikidata:Q482994', 'Wikidata:Q386724', 'Wikidata:Q2188189', 'Schema:MusicAlbum', # 'Schema:CreativeWork', 'DBpedia:Work', 'DBpedia:MusicalWork', 'DBpedia:Album']
# studio album     ['']
# English          ['Wikidata:Q315', 'Schema:Language', 'DBpedia:Language']
# rock     ['Wikidata:Q188451', 'DUL:Concept', 'DBpedia:TopicalConcept', 'DBpedia:Genre', 'DBpedia:MusicGenre']

Now as you can see it's not the most straightforward to do. But at the moment I cannot think of an easy solution that is compatible with spaCy. Do you think maybe having the list of types already in e.g. span._.dbpedia_types could be a good idea? So that you could have a simpler coding:

# this is not implemented
for ent in doc.ents:
    print(ent, '\t', ent._.dbpedia_types)

For your second question instead, let me just check if I understood correctly what you want to do:

  1. perform NER with DBpedia-spotlight
  2. Train a spaCy NER model with the entities from step 1
  3. Use this offline NER model without the need to use DBpedia-spotlight

If that's correct, this is what you can do:

Step 1: you need to define which entity types you want to use from DBpedia-spotlight, and in which order of priority. For example, if an entity is both of types 'DBpedia:Genre', 'DBpedia:MusicGenre', what do you prefer? So you need to define some rules for example basing your decision on a priority list.

Example:

# list of types of interests, order matters
priority_list = ['DBpedia:MusicGenre', 'DBpedia:Album', 'DBpedia:Song', 'DBpedia:MusicalArtist']

# example document as before
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)

# now get the types as before
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]

# now select a single type and only for the entities with type in priority_list
ents_selected_with_type = []
for ent, types in ents_with_types:
    ent_type = next((t for t in priority_list if t in types), None)
    if ent_type:
        ents_selected_with_type.append((ent, ent_type))

# now we only have 2 entities and their type, this is our gold standard for training
print(ents_selected_with_type)

# convert to the training format of spaCy
TRAIN_DATA = [text, {'entities': [(ent.start_char, ent.end_char, ent_type) for ent, ent_type in ents_selected_with_type]}]

Step 2: use the approach from this article: https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7 TRAIN_DATA format is the same, so it should be compatible.

Best, Martino

xyFreddie commented 2 years ago

Thank you so much! That helped a lot. Another quick question: Do I preserve the sentences that don't contain any desired entities in training data? Or can I just drop them, so the training data would only contain sentences that have at least one entity

MartinoMensio commented 2 years ago

Hi @xyLinear , That's a good question! I don't know the common practice for this type of training! This is my wild gues:

Therefore, I would consider two things:

Good luck! Martino