Closed Benja1972 closed 4 years ago
[Edit 27/10/2020] Updated merge_compound
code to use Doc.retokenize
Hi ! Thanks for highlighting this issue.
This is linked to the spacy tokenisation.
According to How to merge words split by hyphen? and Spacy - Tokenize quoted string and lemma of merged spans only has lemma of first token.
Adding a pipe to the spacy model seem to answer some question. I didn't have a proper document but by using the string you shared, the code I wrote works (but it needs to be tested with real data).
The idea here is to create a spacy model add a pipe and then use the spacy_model
parameter of load_document
to process the documents. See code below.
The problem this can generate are:
import pke
import spacy
nlp = spacy.load('en') # or any model
nlp.add_pipe(merge_compounds)
doc = "self-driving, coast-to-coast, peer-to-peer, driver-assistance,web-service"
e = pke.unsupervised.MultipartiteRank()
e.load_document(doc, spacy_model=nlp)
print(e.sentences[0].stems) # compounds should appear
e.candidate_selection()
print(e.candidates) # compounds should now appear
# etc ...
Thank you for insight! I will try your solution on real text and give you back an answer.
It works fine. Thank you. I had only two warning about span
/src/keyphrase_simple.py:76: DeprecationWarning: [W013] As of v2.1.0, Span.merge is deprecated. Please use the more efficient and less error-prone Doc.retokenize context manager instead.
span.merge(lemma=''.join(t.lemma_ for t in span))
/src/keyphrase_simple.py:76: DeprecationWarning: [W013] As of v2.1.0, Doc.merge is deprecated. Please use the more efficient and less error-prone Doc.retokenize context manager instead.
span.merge(lemma=''.join(t.lemma_ for t in span))
I have tested one more example and still have issue on some text. My text is extraction of Wikipedia on self-driving car. Term "self-driving car" is mentioned many time there. But it does not come as output of the model for example MultipartiteRank even with your update of spacy model. I did simplest experiment and use a text like this,
self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car,
As an output I am getting only
Selecting candidates key-phrases
Weighting candidates key-phrases
====================
[('car', 1.0)]
But if I remove all dashes from text and use it like this
self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car,
I am getting
Extracting key-phrases
Selecting candidates key-phrases
Weighting candidates key-phrases
====================
[('self', 0.48648582432442083),
('car', 0.32533316424881015),
('self driving car', 0.18818101142676857)]
In this example I use MultipartiteRank model with merge_compounds function for spacy model
@Benja1972 I tried the code below and got the expected output, maybe the spacy_model
parameter has a typo in your code ?
I have spacy v2.1.9 and en-core-web-sm v2.1.0 (via python3 -m spacy validate
)
# Spacy model with merge_compounds
nlp_m = spacy.load('en')
nlp_m.add_pipe(merge_compounds)
# Vanilla spacy model
nlp = spacy.load('en')
spacy_models = [nlp_m, nlp]
texts = ['self-driving car, self-driving car, self-driving car', 'self driving car, self driving car, self driving car']
for s in spacy_models:
for t in texts:
e = pke.unsupervised.MultipartiteRank()
e.load_document(t, spacy_model=s)
e.candidate_selection()
e.candidate_weighting()
print(e.get_n_best())
# merge compounds, with dash
# [('self-driving car', 1.0)]
# merge compounds, without dash
# [('self', 0.5), ('car', 0.5)]
# vanilla, with dash
# [('self', 0.5), ('car', 0.5)]
# vanilla, without dash
# [('self', 0.5), ('car', 0.5)]
Thank you @ygorg for your efforts. I found what provoke an issue in my code. I use pos limiting phrases to pos = {'NOUN', 'PROPN', 'ADJ'}
and it removes "driving" from list (same way it removes "self-assembled monolayer" from list). If I add "VERB" in pos definition I am getting expected result. If I apply to long text I am getting also single verbs as selected candidates. This is only inconvenience I have now. "ADV" instead of "VERB" doesn't work.
Yes, I found the same issue with "ADV" and "VERB", but in theory when merging compounds, the POS assigned to the compound is the one of the first token. So "self-driving" is treated as an "NOUN" in my case.
Thank you! I will check spacy version and effect on this.
Running your code with this spaCy v2.3.2 en_core_web_sm 2.3.1
I am getting results
[('car', 1.0)]
[('self', 0.48648582432442106), ('car', 0.3566217296811995), ('self driving car', 0.15689244599437963)]
[('self', 0.5), ('car', 0.5)]
[('self', 0.48648582432442106), ('car', 0.3566217296811995), ('self driving car', 0.15689244599437963)]
I am sure it is deprecation issue
Span.merge is deprecated. Please use the more efficient and less error-prone Doc.retokenize context manager instead.
@Benja1972 I updated my answer to use Doc.retonize
, I tried with the newer versions and it works (in theory).
Thank you! @ygorg I will try and report
It work nice with updated function. Thank you
I have a text with many common abbreviation such as "self-driving, coast-to-coast, peer-to-peer, driver-assistance,web-service" When I run TopicRank and MultipartiteRank, I see that they are not captured. I use a stoplist and pos like this
How to fix this behavior? I would like this terms to be present in output Thank you