Users can use their own tokenizer.

GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish

http://sadedegel.ai

MIT License

93 stars 13 forks source link

Users can use their own tokenizer. #139

Open dafajon opened 4 years ago

dafajon commented 4 years ago

When users:

Opt-out existing bert and simple tokenizers.
Are not community contributors to add new tokenizer as a functionality to SadedeGel

They should be able to feed their tokenized text as List[str], List[List[str]] to Doc with an is_tokenized flag.

dafajon commented 4 years ago

This issue will be addressed only on sentence level. A tokenized document will have had lost its information on sentence boundaries which is essential for building Doc object from entry.

A superficial solution to address in token level would be join all tokens with whitespace and continue building doc object with that.

Currently the PR will address: initializing a Doc object with a list of pre-separated sentences given as List[str]

husnusensoy commented 3 years ago

Is there a PR for this ? You are simply asking something similar to from_sentence over Doc . Such as from_token ?

dafajon commented 3 years ago

This was lowprio for me so I did not work on a PR for this. My initial thought on this matter were a user would supply a list of tokens List[str] where sentence boundaries are explicitly stated with an <eos> or <\s> token by the user. Then Doc object will be constructed by from_tokens.

husnusensoy commented 3 years ago

[ ] This might be named as NoopTokenizer
[ ] Corpus to train vocabulary might be provided by the user (List[List[str]])
[ ] And embedding features to be trained during vocabulary build.

askarbozcan commented 2 years ago

Note to self: Tokenizer interface should be easily extendable so that users can add their custom tokenizers if they so desire.