GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 13 forks source link

Users can use their own tokenizer. #139

Open dafajon opened 4 years ago

dafajon commented 4 years ago

When users:

They should be able to feed their tokenized text as List[str], List[List[str]] to Doc with an is_tokenized flag.

dafajon commented 4 years ago

This issue will be addressed only on sentence level. A tokenized document will have had lost its information on sentence boundaries which is essential for building Doc object from entry.

A superficial solution to address in token level would be join all tokens with whitespace and continue building doc object with that.

Currently the PR will address: initializing a Doc object with a list of pre-separated sentences given as List[str]

husnusensoy commented 3 years ago

Is there a PR for this ? You are simply asking something similar to from_sentence over Doc . Such as from_token ?

dafajon commented 3 years ago

This was lowprio for me so I did not work on a PR for this. My initial thought on this matter were a user would supply a list of tokens List[str] where sentence boundaries are explicitly stated with an <eos> or <\s> token by the user. Then Doc object will be constructed by from_tokens.

husnusensoy commented 3 years ago
askarbozcan commented 2 years ago

Note to self: Tokenizer interface should be easily extendable so that users can add their custom tokenizers if they so desire.