Closed davedgd closed 5 years ago
It's a year late but still thank you @davedgd!
Merge via https://github.com/alvations/pywsd/commit/42bdc97b93de67c3464ba288069501e92df76376 since your no longer have the repo on your github account =)
Thank you! No worries about being late; I appreciate your having implemented this. I came up with an easier solution to fix the problem in my code, but this is even better! =)
It would be nice to have the ability to pass a different tokenizer to the disambiguate function for better compatibility when using different tools (e.g., when using pre-tokenized text simply split on whitespace, or pass an alternative tokenizer to NLTK's word_tokenize [e.g., Stanford]). This is important since some tokenizers produce a different set/number of tokens based on internal rules, which can lead to inconsistency in terms of how many tokens are being returned by disambiguate vs. other tools.