no tokenization or preprocessing

Dear Lucas, thanks for putting this together. I am using your module for text data in Sumerian (an ancient language). This data is tokenized (and lemmatized) and does not work well with the standard preprocessor. I have adapted your code to replace the use_TextBlob option with a use_tokenizer option, with default use_tokenizer = False. This default option accepts a list of tokens and does no preprocessing. The option use_tokenizer = True accepts a string and will do the default preprocessing and tokenizing. TextBlob is not useful to me, so I removed that. I am using this in my Computational Assyriology project (the package is currently used in Chapter 3 - very much a work in progress!) and of course I do credit you and give a link to the page at PyPi. Niek

Hi Professor Niek, sounds like a good idea to include the option to accept a list of tokens without preprocessing. I overlooked this. Happy to hear that the package is useful. Regards, Lucas

On Fri, 2 Aug 2019 at 05:52, Niek Veldhuis notifications@github.com wrote:

Dear Lucas, thanks for putting this together. I am using your module for text data in Sumerian (an ancient language). This data is tokenized (and lemmatized) and does not work well with the standard preprocessor. I have adapted your code to replace the use_TextBlob option with a use_tokenizer option, with default use_tokenizer = False. This default option accepts a list of tokens and does no preprocessing. The option use_tokenizer = True accepts a string and will do the default preprocessing and tokenizing. TextBlob is not useful to me, so I removed that. I am using this in my Computational Assyriology https://github.com/niekveldhuis/compass project (the package is currently used in Chapter 3 - very much a work in progress!) and of course I do credit you and give a link to the page at PyPi. Niek

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LSYS/LexicalRichness/issues/2?email_source=notifications&email_token=ACQGB25ANFZMB375D4XFC7LQCNLJ7A5CNFSM4IIWUB62YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HC5XVWQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQGB24NQOJ663ADDBWCECDQCNLJ7ANCNFSM4IIWUB6Q .

LSYS / LexicalRichness

no tokenization or preprocessing #2