Closed hammad26 closed 5 years ago
Thanks for the report! It looks like even though the Doc
object is constructed from an array with the lemma, it seems like that lemma gets overwritten internally by the English lookup table (that's shipped with the language data). So this is probably a bug in spaCy.
Maybe this wrapper should have an option to disable using spaCy's underlying language data. It's nice if you want to use stuff like token.like_num
, but it can also cause a side-effect like this. If it constructs a blank Language
class instead, this wouldn't be happening.
Edit: I've added a separate step to add the lemmas last. Turns out they were automatically overwritten when the POS tags were added, based on spaCy's lemma rules. So we're now setting them afterwards to prevent this and use the predicted lemmas.
So, when a language has models in both Spacy and Stanford, then how results will be coming?
If you're using this wrapper, you're not using spaCy's models so you won't be seeing any of spaCy's predictions. The pipeline will also be empty, so none of spaCy's components that predict something will be run (and they couldn't be run, because there are no model weights loaded).
Using stanfordnlp, Lemma results on an input are:
Now using latest wrapper given by spacy spacy-stanfordnlp and getting following results.
So, it look like that Spacy is given priority(you can see for word "he" and "better"). So, when a language has models in both Spacy and Stanford, then how results will be coming? Can you provide a full details how linguistic features will be affected in this case?