explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.86k stars 4.38k forks source link

Are there projects that combine Spacy and Skos taxonomies? #2176

Closed jansaasman closed 6 years ago

jansaasman commented 6 years ago

Hi: this is a low priority and not very technical issue but I wanted to give it a try anyway. I'm having a blast exploring Spacy. The code and the APIs are very well done, the tutorials and documentation are amazing. So I want to combine it with my daily world of working with semantic technologies. What I tried to find on Google was projects that take an existing RDF/SKOS taxonomy and automatically add the leave nodes, both the preflabels and the altlabels and add them programmatically to the entity extractor. Does anyone know if this has been done? Cheers, Jans

DuyguA commented 6 years ago

Hi jansaasman,

The procdure you describe is called "ontology learning", i.e. learn ontology elements from sturctured or unstructured data.

This sort of projects require huge amount of ML components. Extracting relations and placing them in an appropriate place is not an easy task. Usually one puts several modules together.

I include this link for a quick survey: https://www.inf.uni-hamburg.de/en/inst/ab/lt/publications/2005-biemannetal-ldvforum-ontology.pdf

Myself uses ontologies (owl format) for my chatbot project. Ontology goes into the context manager module and SpaCy goes into the NLU module, we combine a knowledge graph with linguistic information that we mined from user queries.

Cheers, Duygu.

jansaasman commented 6 years ago

Hi Duygu: thanks very much for taking the time to write a response to my question. Your answer confuses me a little bit: I already have a taxonomy so I don't really have to learn it from the unstructured data, do I? I had hoped that by providing Spacy with word-lists the entity extractor would magically do its work .. or am I missing something? Regards, Jans

On 4/10/2018 7:30 AM, Duygu Altinok wrote:

Hi jansaasman,

The procdure you describe is called "ontology learning", i.e. learn ontology elements from sturctured or unstructured data.

This sort of projects require huge amount of ML components. Extracting relations and placing them in an appropriate place is not an easy task. Usually one puts several modules together.

I include this link for a quick survey: https://www.inf.uni-hamburg.de/en/inst/ab/lt/publications/2005-biemannetal-ldvforum-ontology.pdf

Myself uses ontologies (owl format) for my chatbot project. Ontology goes into the context manager module and SpaCy goes into the NLU module, we combine a knowledge graph with linguistic information that we mined from user queries.

Cheers, Duygu.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/2176#issuecomment-380121191, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPBfqzmTneaxmLcwME-QlGzrPgbUvy5ks5tnMIigaJpZM4TEMtO.

ines commented 6 years ago

@jansaasman Yes, that's definitely possible, too. spaCy's PhraseMatcher is designed to match very large terminology list by using Doc objects as match patterns. spaCy's doc.ents is writeable, so you can add the matches based on your word list and assign custom labels to them.

Here's a simple example:

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load('en')  # or any other model
matcher = PhraseMatcher(nlp.vocab)

word_list = ['single', 'and', 'multi-token words', 'from', 'your', 'list']
patterns = [nlp(word) for word in word_list]  # create patterns from your word
matcher.add('YOUR_ENTITY_LABEL', None, *patterns)

doc = nlp("lots of text you want to process...")
matches = matcher(doc)
for match_id, start, end in matches:
   entity = Span(doc, start, end, label=label)  # create a span from our match
   doc.ents = list(doc.ents) + [entity]  # add it to the entities

You might also find these two examples useful – they show how to use the PhraseMatcher to match large terminology lists and add entity labels for them:

Depending on your use case, a fully rule-based approach could be absolutely sufficient – especially if you already have extensive word lists and taxonomies. And if you do decide to train a model to recognise similar terms and phrases in context later on, you can use the matches produced by the PhraseMatcher as training data. (This section has some more details on this.)

jansaasman commented 6 years ago

Thanks Ines: that is very helpful! I'll give it a try straight away....

ok: the example almost works, it is only complaining about the label=label argument... I changed into label=1 and the example started to work

now I'll read the documentation to see what we want to do with this label :-)

no need to reply, thanks - Jans

On 4/10/2018 3:46 PM, Ines Montani wrote:

@jansaasman https://github.com/jansaasman Yes, that's definitely possible, too. spaCy's |PhraseMatcher| is designed to match very large terminology list by using |Doc| objects as match patterns. spaCy's |doc.ents| is writeable, so you can add the matches based on your word list and assign custom labels to them.

Here's a simple example:

import spacy from spacy.matcherimport PhraseMatcher from spacy.tokensimport Span

nlp= spacy.load('en')# or any other model matcher= PhraseMatcher(nlp.vocab)

word_list= ['single','and','multi-token words','from','your','list'] patterns= [nlp(word)for wordin word_list]# create patterns from your word matcher.add('YOUR_ENTITY_LABEL',None,*patterns)

doc= nlp("lots of text you want to process...") matches= matcher(doc) for match_id, start, endin matches: entity= Span(doc, start, end,label=label)# create a span from our match doc.ents= list(doc.ents)+ [entity]# add it to the entities

You might also find these two examples useful – they show how to use the |PhraseMatcher| to match large terminology lists and add entity labels for them:

Depending on your use case, a fully rule-based approach could be absolutely sufficient – especially if you already have extensive word lists and taxonomies. And if you do decide to train a model to recognise similar terms and phrases in context later on, you can use the matches produced by the |PhraseMatcher| as training data. (This section https://spacy.io/usage/training#training-data has some more details on this.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/2176#issuecomment-380271508, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPBficcWIMws_amj1SFq1ZZj_djVtqtks5tnTZbgaJpZM4TEMtO.

ines commented 6 years ago

Oh sorry, that was a typo – it was supposed to say label=match_id. This will assign the ID of the string 'YOUR_ENTITY_LABEL' as the entity label. You can add several rules to your matcher like this – for example, one rule with per word list category.

I changed into label=1 and the example started to work

Using 1 as a label works, too – but it might lead to unexpected or confusing results later on if you ever convert the integer label to a string. Under the hood, spaCy encodes all strings to integers and stores them in the StringStore – you can find more info on this here. So an ID like 1 also represents a string – usually something random like a part-of-speech tag or lexical attribute (because spaCy's labels are encoded, too).

Here's an interesting example:

doc = nlp(u"hello world")
span = Span(doc, 0, 1, label=99)
print(span.text, span.label_)  # let's look at the span's text and string label
# 'hello' 'VERB'  <--- wtf?

Turns out that the part-of-speech tag "VERB" happens to be mapped to the ID 99 – you test this by looking it up in the StringStore, i.e. vocab.strings:

nlp.vocab.strings[99]  # VERB
nlp.vocab.strings['VERB']  # 99
nlp.vocab.strings['YOUR_CUSTOM_LABEL']  # 11211630373680995617
jansaasman commented 6 years ago

Cool. I changed that it now it makes sense. Thanks for your help! Jans

On 4/10/2018 5:00 PM, Ines Montani wrote:

Oh sorry, that was a typo – it was supposed to say |label=match_id|. This will assign the ID of the string |'YOUR_ENTITY_LABEL'| as the entity label. You can add several rules to your matcher like this – for example, one rule with per word list category.

I changed into label=1 and the example started to work

Using |1| as a label works, too – but it might lead to unexpected or confusing results later on if you ever convert the integer label to a string. Under the hood, spaCy encodes all strings to integers and stores them in the |StringStore| – you can find more info on this here https://spacy.io/usage/spacy-101#vocab. So an ID like |1| also represents a string – usually something random like a part-of-speech tag or lexical attribute (because spaCy's labels are encoded, too).

Here's an interesting example:

doc= nlp(u"hello world") span= Span(doc,0,1,label=99) print(span.text, span.label_)# let's look at the span's text and string label

'hello' 'VERB' <--- wtf?

Turns out that the part-of-speech tag "VERB" happens to be mapped to the ID |99| – you test this by looking it up in the |StringStore|, i.e. |vocab.strings|:

nlp.vocab.strings[99]# VERB nlp.vocab.strings['VERB']# 99 nlp.vocab.strings['YOUR_CUSTOM_LABEL']# 11211630373680995617

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/2176#issuecomment-380284148, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPBfhcvBocqR2XpdqCJOkcUfckEfTB0ks5tnUeMgaJpZM4TEMtO.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.