boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 290 forks source link

Keep unattached combined words in original text #147

Closed Benja1972 closed 4 years ago

Benja1972 commented 4 years ago

I have a text with many common abbreviation such as "self-driving, coast-to-coast, peer-to-peer, driver-assistance,web-service" When I run TopicRank and MultipartiteRank, I see that they are not captured. I use a stoplist and pos like this

pos = {'NOUN', 'PROPN', 'ADJ'}
stoplist = list(string.punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
stoplist += stopwords.words('english')
stoplist.remove('-')

How to fix this behavior? I would like this terms to be present in output Thank you

ygorg commented 4 years ago

[Edit 27/10/2020] Updated merge_compound code to use Doc.retokenize

Hi ! Thanks for highlighting this issue.

This is linked to the spacy tokenisation.

According to How to merge words split by hyphen? and Spacy - Tokenize quoted string and lemma of merged spans only has lemma of first token.

Adding a pipe to the spacy model seem to answer some question. I didn't have a proper document but by using the string you shared, the code I wrote works (but it needs to be tested with real data).

The idea here is to create a spacy model add a pipe and then use the spacy_model parameter of load_document to process the documents. See code below.

The problem this can generate are:

import pke
import spacy
nlp = spacy.load('en')  # or any model
nlp.add_pipe(merge_compounds)

doc = "self-driving, coast-to-coast, peer-to-peer, driver-assistance,web-service"

e = pke.unsupervised.MultipartiteRank()
e.load_document(doc, spacy_model=nlp)
print(e.sentences[0].stems)  # compounds should appear
e.candidate_selection()
print(e.candidates)  # compounds should now appear
# etc ...
```python def merge_compounds(d): """ Merge compounds to be one token A compound is two tokens separated by a hyphen when the tokens are right next to the hyphen d (spacy.Doc): Document Returns: spacy.Doc > [t.text for t in nlp('peer-to-peer-to-peer')] ['peer', '-', 'to', '-', 'peer', 'to', '-', 'peer'] > [t.text for t in merge_compounds(nlp('peer-to-peer-to-peer'))] ['peer-to-peer-to-peer'] """ # Returns beginning and end offset of spacy.Token offsets = lambda t: (t.idx, t.idx+len(t)) # Identify the hyphens # for each token is it a hyphen and the next and preceding token are right next to the hyphen spans = [(i-1, i+1) for i in range(len(d)) if i != 0 and i != len(d) and d[i].text == '-' and \ offsets(d[i-1])[1] == offsets(d[i])[0] and \ offsets(d[i+1])[0] == offsets(d[i])[1] ] # merging spans to account for multi-compound terms merged = [] for i, (b, e) in enumerate(spans): # if the last spans ends when the current span begins, # merge those if merged and b == merged[-1][1]: merged[-1] = (merged[-1][0], e) else: merged.append((b, e)) # Merge the compounds in the document with d.retokenize() as retok: for b, e in merged: retok.merge(d[b:e+1], attrs={ 'POS': d[b].pos_, 'LEMMA': ''.join(t.lemma_ for t in d[b:e+1]) }) return d ```
Benja1972 commented 4 years ago

Thank you for insight! I will try your solution on real text and give you back an answer.

Benja1972 commented 4 years ago

It works fine. Thank you. I had only two warning about span

/src/keyphrase_simple.py:76: DeprecationWarning: [W013] As of v2.1.0, Span.merge is deprecated. Please use the more efficient and less error-prone Doc.retokenize context manager instead.
  span.merge(lemma=''.join(t.lemma_ for t in span))
/src/keyphrase_simple.py:76: DeprecationWarning: [W013] As of v2.1.0, Doc.merge is deprecated. Please use the more efficient and less error-prone Doc.retokenize context manager instead.
  span.merge(lemma=''.join(t.lemma_ for t in span))
Benja1972 commented 4 years ago

I have tested one more example and still have issue on some text. My text is extraction of Wikipedia on self-driving car. Term "self-driving car" is mentioned many time there. But it does not come as output of the model for example MultipartiteRank even with your update of spacy model. I did simplest experiment and use a text like this,

self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car, self-driving car,

As an output I am getting only

Selecting candidates key-phrases
Weighting candidates key-phrases
====================
[('car', 1.0)]

But if I remove all dashes from text and use it like this

self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car, self driving car,

I am getting

Extracting key-phrases
Selecting candidates key-phrases
Weighting candidates key-phrases
====================
[('self', 0.48648582432442083),
 ('car', 0.32533316424881015),
 ('self driving car', 0.18818101142676857)]

In this example I use MultipartiteRank model with merge_compounds function for spacy model

ygorg commented 4 years ago

@Benja1972 I tried the code below and got the expected output, maybe the spacy_model parameter has a typo in your code ? I have spacy v2.1.9 and en-core-web-sm v2.1.0 (via python3 -m spacy validate)

# Spacy model with merge_compounds
nlp_m = spacy.load('en')
nlp_m.add_pipe(merge_compounds)

# Vanilla spacy model
nlp = spacy.load('en')

spacy_models = [nlp_m, nlp]

texts = ['self-driving car, self-driving car, self-driving car', 'self driving car, self driving car, self driving car']
for s in spacy_models:
    for t in texts:
        e = pke.unsupervised.MultipartiteRank()
        e.load_document(t, spacy_model=s)
        e.candidate_selection()
        e.candidate_weighting()
        print(e.get_n_best())
# merge compounds, with dash
# [('self-driving car', 1.0)]
# merge compounds, without dash
# [('self', 0.5), ('car', 0.5)]
# vanilla, with dash
# [('self', 0.5), ('car', 0.5)]
# vanilla, without dash
# [('self', 0.5), ('car', 0.5)]
Benja1972 commented 4 years ago

Thank you @ygorg for your efforts. I found what provoke an issue in my code. I use pos limiting phrases to pos = {'NOUN', 'PROPN', 'ADJ'} and it removes "driving" from list (same way it removes "self-assembled monolayer" from list). If I add "VERB" in pos definition I am getting expected result. If I apply to long text I am getting also single verbs as selected candidates. This is only inconvenience I have now. "ADV" instead of "VERB" doesn't work.

ygorg commented 4 years ago

Yes, I found the same issue with "ADV" and "VERB", but in theory when merging compounds, the POS assigned to the compound is the one of the first token. So "self-driving" is treated as an "NOUN" in my case.

Benja1972 commented 4 years ago

Thank you! I will check spacy version and effect on this.

Benja1972 commented 4 years ago

Running your code with this spaCy v2.3.2 en_core_web_sm 2.3.1

I am getting results

[('car', 1.0)]
[('self', 0.48648582432442106), ('car', 0.3566217296811995), ('self driving car', 0.15689244599437963)]
[('self', 0.5), ('car', 0.5)]
[('self', 0.48648582432442106), ('car', 0.3566217296811995), ('self driving car', 0.15689244599437963)]

I am sure it is deprecation issue

Span.merge is deprecated. Please use the more efficient and less error-prone Doc.retokenize context manager instead.
ygorg commented 4 years ago

@Benja1972 I updated my answer to use Doc.retonize, I tried with the newer versions and it works (in theory).

Benja1972 commented 4 years ago

Thank you! @ygorg I will try and report

Benja1972 commented 4 years ago

It work nice with updated function. Thank you