Closed mehmetilker closed 5 years ago
Thanks for the report! This is very mysterious and there's definitely something up here. But I just tried your exact code and I can't reproduce this 🤔
I'm getting the following results:
ProperNounRule : US > 0 : 1
ProperNounRule : Juan > 10 : 11
ProperNounRule : Juan Guaido > 10 : 12
ProperNounRule : Guaido > 11 : 12
ProperNounRule : President > 17 : 18
ProperNounRule : President Nicholas > 17 : 19
ProperNounRule : Nicholas > 18 : 19
ProperNounRule : President Nicholas Maduro > 17 : 20
ProperNounRule : Nicholas Maduro > 18 : 20
ProperNounRule : Maduro > 19 : 20
ProperNounRule : May > 21 : 22
ProperNounRule : Maduro > 28 : 29
ProperNounRule : Guaido > 30 : 31
ProperNounRule : US > 39 : 40
ProperNounRule : US President > 39 : 41
ProperNounRule : President > 40 : 41
ProperNounRule : US President Trump > 39 : 42
ProperNounRule : President Trump > 40 : 42
ProperNounRule : Trump > 41 : 42
I tried it with both versions of StanfordNLP and both nightlies. I also tried it in the current stable spaCy version (where the operators and quantifiers implementation is different and inconsistent). But even in that scenario, I get correct matches – just fewer.
My mistake. Sorry for confusion.
I was trying to merge match results but I have removed the merge code while pasting here for clearity.
Apparently match result is changing while merging spans. I will try to find a way to properly merge tokens..
matches = matcher(doc)
print('\n>>> Match result')
for (match_id, start, end) in matches:
label = doc.vocab.strings[match_id]
span = doc[start:end]
print(label, ":", str(span), ">", start, ":", end)
#span.merge()
if end-start > 1:
lemmas = [t.lemma_ for t in span]
lemmas_text = "_".join(lemmas)
with doc.retokenize() as retokenizer:
retokenizer.merge(span, attrs={'LEMMA': lemmas_text})
@mehmetilker The new retokenizer
API is really optimised for bulk merging and will then take care of aligning the merges automatically (unlike the old span.merge
, which made this difficult). So ideally, you want to be collecting your spans first and then merge them all at once.
Something like this should work:
spans_to_merge = []
for match_id, start, end in matches:
span = doc[start:end]
if end - start > 1:
spans_to_merge.append(span)
with doc.retokenize() as retokenizer:
for span in spans_to_merge:
lemmas = [t.lemma_ for t in span]
lemmas_text = "_".join(lemmas)
retokenizer.merge(span, attrs={'LEMMA': lemmas_text})
Great. Thank you for example.
I have successfully run parser and see pos/dependency result same as with stanfordnlp. But when I run Matcher over tokens I see unexpected matchs. For example I should get only PROPN+ tokens but I see verb matches as well. And empty match...
I have gone through following code here but I could not see something related with Matcher. https://github.com/explosion/spacy-stanfordnlp/blob/master/spacy_stanfordnlp/language.py
BTW, default installation comes with spacy-nightly 6a and I have tried with 9a as well.
Bolds are expected result.