explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
722 stars 60 forks source link

Matcher result problem #5

Closed mehmetilker closed 5 years ago

mehmetilker commented 5 years ago

I have successfully run parser and see pos/dependency result same as with stanfordnlp. But when I run Matcher over tokens I see unexpected matchs. For example I should get only PROPN+ tokens but I see verb matches as well. And empty match...

I have gone through following code here but I could not see something related with Matcher. https://github.com/explosion/spacy-stanfordnlp/blob/master/spacy_stanfordnlp/language.py

BTW, default installation comes with spacy-nightly 6a and I have tried with 9a as well.

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
from spacy.matcher import Matcher
import csv

config = {
    'processors': 'tokenize,pos,lemma', #mwt, depparse
    'lang': 'en', # Language code for the language to build the Pipeline in
}

snlp = stanfordnlp.Pipeline(**config)
nlp = StanfordNLPLanguage(snlp)

matcher = Matcher(nlp.vocab)
matcher.add("ProperNounRule", None, *[
    # [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
    # [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    [{'POS': 'PROPN', 'OP': '+'}]
    # {'POS': 'NOUN', 'OP': '+'}
])

text = "US was among the first countries to recognise opposition leader Juan Guaido as legitimate leader, arguing President Nicholas Maduro's May 2018 re-election was a sham. Maduro accuses Guaido of being a coup-mongering puppet for US President Trump."

text = text.replace("“", "\"").replace("”", "\"").replace("’", "'")
doc = nlp(text)

matches = matcher(doc)
print('\n>>> Match result')
for (match_id, start, end) in matches:
    label = doc.vocab.strings[match_id] 
    span = doc[start:end]   
    print(label, ":", str(span), ">", start, ":", end)  

Bolds are expected result.

**ProperNounRule : US > 0 : 1**
**ProperNounRule : Juan > 10 : 11**
**ProperNounRule : Juan Guaido > 10 : 12**
ProperNounRule : as > 11 : 12
**ProperNounRule : Nicholas > 17 : 18**
**ProperNounRule : Nicholas Maduro > 17 : 19**
ProperNounRule : 's > 18 : 19
**ProperNounRule : Nicholas Maduro's May > 17 : 20**
ProperNounRule : 2018 re-election > 18 : 20
ProperNounRule : was > 19 : 20
ProperNounRule : sham > 21 : 22
ProperNounRule : a > 28 : 29
ProperNounRule : - > 30 : 31
ProperNounRule :  > 39 : 40
ProperNounRule :  > 39 : 41
ines commented 5 years ago

Thanks for the report! This is very mysterious and there's definitely something up here. But I just tried your exact code and I can't reproduce this 🤔

I'm getting the following results:

ProperNounRule : US > 0 : 1
ProperNounRule : Juan > 10 : 11
ProperNounRule : Juan Guaido > 10 : 12
ProperNounRule : Guaido > 11 : 12
ProperNounRule : President > 17 : 18
ProperNounRule : President Nicholas > 17 : 19
ProperNounRule : Nicholas > 18 : 19
ProperNounRule : President Nicholas Maduro > 17 : 20
ProperNounRule : Nicholas Maduro > 18 : 20
ProperNounRule : Maduro > 19 : 20
ProperNounRule : May > 21 : 22
ProperNounRule : Maduro > 28 : 29
ProperNounRule : Guaido > 30 : 31
ProperNounRule : US > 39 : 40
ProperNounRule : US President > 39 : 41
ProperNounRule : President > 40 : 41
ProperNounRule : US President Trump > 39 : 42
ProperNounRule : President Trump > 40 : 42
ProperNounRule : Trump > 41 : 42

I tried it with both versions of StanfordNLP and both nightlies. I also tried it in the current stable spaCy version (where the operators and quantifiers implementation is different and inconsistent). But even in that scenario, I get correct matches – just fewer.

mehmetilker commented 5 years ago

My mistake. Sorry for confusion.

I was trying to merge match results but I have removed the merge code while pasting here for clearity.

Apparently match result is changing while merging spans. I will try to find a way to properly merge tokens..

matches = matcher(doc)
print('\n>>> Match result')
for (match_id, start, end) in matches:
    label = doc.vocab.strings[match_id] 
    span = doc[start:end]   
    print(label, ":", str(span), ">", start, ":", end)  
    #span.merge()
    if end-start > 1:
        lemmas = [t.lemma_ for t in span]
        lemmas_text = "_".join(lemmas)
        with doc.retokenize() as retokenizer:
            retokenizer.merge(span, attrs={'LEMMA': lemmas_text})
ines commented 5 years ago

@mehmetilker The new retokenizer API is really optimised for bulk merging and will then take care of aligning the merges automatically (unlike the old span.merge, which made this difficult). So ideally, you want to be collecting your spans first and then merge them all at once.

Something like this should work:

spans_to_merge = []
for match_id, start, end in matches:
    span = doc[start:end]   
    if end - start > 1:
        spans_to_merge.append(span)

with doc.retokenize() as retokenizer:
    for span in spans_to_merge:
        lemmas = [t.lemma_ for t in span]
        lemmas_text = "_".join(lemmas)
        retokenizer.merge(span, attrs={'LEMMA': lemmas_text})
mehmetilker commented 5 years ago

Great. Thank you for example.