Open xyFreddie opened 2 years ago
Hi @xyLinear ,
Thank you for opening this issue. Originally when developing this package, I wanted to use the correct labels in ent.label_
as you are saying. Then a little technicality raised: DBpedia-spotlight is giving a list of types and spaCy only allows for a single type for each entity. So I made the decision to keep only one type DBPEDIA_ENT
and to let the details of the types in the span._.dbpedia_raw_result
.
So what I would suggest, with the current state of this library, is to use the following snippet:
# setting up
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight')
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)
# see which entities have been recognised
print(doc.ents)
# ---> (Please Please Me, studio album, English, rock)
# see the complete details for the first entity
doc.ents[0]._.dbpedia_raw_result
# --> {'@URI': 'http://dbpedia.org/resource/Please_Please_Me', '@support': '224', '@types': 'Wikidata:Q482994,Wikidata:Q386724,Wikidata:Q2188189,Schema:MusicAlbum,Schema:CreativeWork,DBpedia:Work,DBpedia:MusicalWork,DBpedia:Album', '@surfaceForm': 'Please Please Me', '@offset': '0', '@similarityScore': '0.999998495059141', '@percentageOfSecondRank': '1.5049431293805952E-6'}
# extract the types for each entity, which are separated by commas
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]
for ent, types in ents_with_types:
print(ent, '\t', types)
# -->
# Please Please Me ['Wikidata:Q482994', 'Wikidata:Q386724', 'Wikidata:Q2188189', 'Schema:MusicAlbum', # 'Schema:CreativeWork', 'DBpedia:Work', 'DBpedia:MusicalWork', 'DBpedia:Album']
# studio album ['']
# English ['Wikidata:Q315', 'Schema:Language', 'DBpedia:Language']
# rock ['Wikidata:Q188451', 'DUL:Concept', 'DBpedia:TopicalConcept', 'DBpedia:Genre', 'DBpedia:MusicGenre']
Now as you can see it's not the most straightforward to do. But at the moment I cannot think of an easy solution that is compatible with spaCy. Do you think maybe having the list of types already in e.g. span._.dbpedia_types
could be a good idea? So that you could have a simpler coding:
# this is not implemented
for ent in doc.ents:
print(ent, '\t', ent._.dbpedia_types)
For your second question instead, let me just check if I understood correctly what you want to do:
If that's correct, this is what you can do:
Step 1: you need to define which entity types you want to use from DBpedia-spotlight, and in which order of priority. For example, if an entity is both of types 'DBpedia:Genre', 'DBpedia:MusicGenre'
, what do you prefer? So you need to define some rules for example basing your decision on a priority list.
Example:
# list of types of interests, order matters
priority_list = ['DBpedia:MusicGenre', 'DBpedia:Album', 'DBpedia:Song', 'DBpedia:MusicalArtist']
# example document as before
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)
# now get the types as before
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]
# now select a single type and only for the entities with type in priority_list
ents_selected_with_type = []
for ent, types in ents_with_types:
ent_type = next((t for t in priority_list if t in types), None)
if ent_type:
ents_selected_with_type.append((ent, ent_type))
# now we only have 2 entities and their type, this is our gold standard for training
print(ents_selected_with_type)
# convert to the training format of spaCy
TRAIN_DATA = [text, {'entities': [(ent.start_char, ent.end_char, ent_type) for ent, ent_type in ents_selected_with_type]}]
Step 2: use the approach from this article: https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7
TRAIN_DATA
format is the same, so it should be compatible.
Best, Martino
Thank you so much! That helped a lot. Another quick question: Do I preserve the sentences that don't contain any desired entities in training data? Or can I just drop them, so the training data would only contain sentences that have at least one entity
Hi @xyLinear , That's a good question! I don't know the common practice for this type of training! This is my wild gues:
Therefore, I would consider two things:
confidence
configuration parameter of the DBpedia-spotlight API when you initialize the pipeline (you can also specify the types wanted there): nlp.add_pipe('dbpedia_spotlight', config={'types': 'DBpedia:Place,DBpedia:Genre', 'confidence': 0.25})
. You may want also entities with lower confidence than the default valueGood luck! Martino
I want to use this to annotate my training data. For example I specified DBpedia:Album, DBpedia:Song, DBpedia:MusicalArtist If I print ent.label_, it would only return 'DBPEDIA_ENT' for all entities. Is there any way to actually retrieve Album, Song, etc? Additional question, what's the format of text if I want to use this as training data input of spacy? Thank you in advance