Closed abdullah-alnahas closed 2 years ago
Hi @abdullah-alnahas
Thanks for sharing your code and results here. One thing that is very important is adjusting the Tokenizer to get tokens similar to the original datasets (gold standard). Or, bypassing the Tokenizer and using the tokens from the test dataset. (this way you can align token by token which results in lemma by lemma)
I am going to the alignments and keeping this thread up to date. Once we complete this, I will include it in the documentation for the future, so thanks again for your contribution.
I am comparing the performance of the most popular lemmatization tools. I have found benchmark results for Stanza, Trankit, and spaCy on Universal Dependencies version 2.5. However, I couldn't find anything related to Spark NLP. Could you please point me to it if such a benchmark has already been done?
I have tried to do it myself, and I got an aligned accuracy of ~
78%
(I am attaching the code and results below). Questions:Appreciate your input.