explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Error when Tokenizer used #22

Closed mehmetilker closed 11 months ago

mehmetilker commented 4 years ago

I am seeing following exception when I use Tokenizer as shown below.

If I close matcher part, no exception but no pos tagging this time.

I have tried to find info about exception but only thing I could found this one and it is unrelated: https://github.com/explosion/spaCy/issues/4100

Is there any other pipeline configuration I have to use related with Tokenizer? Could not see any in documentation?

By the way, I wanted to try add_special_case but wrapper does not support it I guess: "AttributeError: 'Tokenizer' object has no attribute 'add_special_case'"

Traceback (most recent call last):
  File "c:/x/_dev/_temp/pro/playground/spacydemo/a.py", line 41, in <module>
    matches = matcher(doc)
  File "matcher.pyx", line 224, in spacy.matcher.matcher.Matcher.__call__
ValueError: [E155] The pipeline needs to include a tagger in order to use Matcher or PhraseMatcher with the attributes POS, TAG, or LEMMA. Try using nlp() instead of nlp.make_doc() or list(nlp.pipe()) instead of list(nlp.tokenizer.pipe()).
import logging
import re
import stanfordnlp
from spacy.matcher import Matcher
from spacy_stanfordnlp import StanfordNLPLanguage
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

processors = 'tokenize,pos,lemma'
config = {
    #'tokenize_pretokenized': True, #!!!  the text will be interpreted as already tokenized on white space and sentence split by newlines.
    'processors': processors,  # mwt, depparse
    'lang': 'en',  # Language code for the language to build the Pipeline in
}
snlp = stanfordnlp.Pipeline(**config)
nlp = StanfordNLPLanguage(snlp)

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp.tokenizer = custom_tokenizer(nlp)

text = "Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.'"

matcher = Matcher(nlp.vocab)
matcher.add("COUNTRY", None, *[
    [{'LEMMA': 'practice'}],
])

doc = nlp(text)

matches = matcher(doc)
for (match_id, start, end) in matches:
    label = doc.vocab.strings[match_id]
    print(label, start, end, doc[start:end])

print(doc)
for token in doc:
    #print("\t\t", token.text, "\t\t", token.lemma_, "\t\t", token.tag_, "\t\t", token.pos_)
    print(f"\t {token.text:{20}} - {token.lemma_:{15}} - {token.tag_:{5}} - {token.pos_:{5}}")
ines commented 4 years ago

Ah, I think the problem here is that spaCy's matcher currently validates patterns with the lemma attribute by checking if the document is tagged (because the lemmatizer typically uses the part-of-speech tags). However, I can see how this is problematic for lookup-only lemmatizers or solutions from other sources.

You should be able to work around this by setting doc.is_tagged = True yourself and tricking spaCy into thinking your Doc is tagged.

By the way, I wanted to try add_special_case but wrapper does not support it I guess: "AttributeError: 'Tokenizer' object has no attribute 'add_special_case'"

Are you calling the method correctly? It's a existing method on spaCy's Tokenizer object and it should work as described here: https://spacy.io/api/tokenizer#add_special_case

mehmetilker commented 4 years ago

No exception when I set doc.is_tagged = True but no POS tags as well. Which makes custom tokenizer useless for me.

The model I am using do not create seperate token for " For a sentence like: This is "Hello world"; Tokens: This is "Hello world" What I want to do create separate token for " with custom tokenizer. So in this case when custom tokenizer creates separate token for " it does not tag it. Then matcher is failing?

For the second problem; Using same sample code above, I comment out nlp.tokenizer = custom_tokenizer(nlp) part and add;

special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
mehmetilker commented 4 years ago

While still trying to find a way to solve problem I wrote previously, I want to add another case. Previous one is about improving splitting. This one is unnecessary splitting (url text)

I will try to create matcher rule to merge but improving Tokenization rule would be better I guess.

Tokens: Here is a video about it https ://www.youtube.com/watch ?v=i 6 VaOpvIh VQ&feature=youtu.be

adrianeboyd commented 11 months ago

Just going through some older issues...

I think this coupling between the tag/lemma checks is no longer a problem in spacy v3.

But please feel free to reopen if you're still running to issues!