Closed mehmetilker closed 11 months ago
Ah, I think the problem here is that spaCy's matcher currently validates patterns with the lemma attribute by checking if the document is tagged (because the lemmatizer typically uses the part-of-speech tags). However, I can see how this is problematic for lookup-only lemmatizers or solutions from other sources.
You should be able to work around this by setting doc.is_tagged = True
yourself and tricking spaCy into thinking your Doc
is tagged.
By the way, I wanted to try add_special_case but wrapper does not support it I guess: "AttributeError: 'Tokenizer' object has no attribute 'add_special_case'"
Are you calling the method correctly? It's a existing method on spaCy's Tokenizer
object and it should work as described here: https://spacy.io/api/tokenizer#add_special_case
No exception when I set doc.is_tagged = True
but no POS tags as well. Which makes custom tokenizer useless for me.
The model I am using do not create seperate token for "
For a sentence like: This is "Hello world";
Tokens:
This
is
"Hello
world"
What I want to do create separate token for "
with custom tokenizer.
So in this case when custom tokenizer creates separate token for "
it does not tag it. Then matcher is failing?
For the second problem;
Using same sample code above, I comment out nlp.tokenizer = custom_tokenizer(nlp)
part
and add;
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
While still trying to find a way to solve problem I wrote previously, I want to add another case. Previous one is about improving splitting. This one is unnecessary splitting (url text)
I will try to create matcher rule to merge but improving Tokenization rule would be better I guess.
Tokens: Here is a video about it https ://www.youtube.com/watch ?v=i 6 VaOpvIh VQ&feature=youtu.be
Just going through some older issues...
I think this coupling between the tag/lemma checks is no longer a problem in spacy v3.
But please feel free to reopen if you're still running to issues!
I am seeing following exception when I use Tokenizer as shown below.
If I close matcher part, no exception but no pos tagging this time.
I have tried to find info about exception but only thing I could found this one and it is unrelated: https://github.com/explosion/spaCy/issues/4100
Is there any other pipeline configuration I have to use related with Tokenizer? Could not see any in documentation?
By the way, I wanted to try add_special_case but wrapper does not support it I guess: "AttributeError: 'Tokenizer' object has no attribute 'add_special_case'"