Open dafajon opened 4 years ago
This issue will be addressed only on sentence level. A tokenized document will have had lost its information on sentence boundaries which is essential for building Doc object from entry.
A superficial solution to address in token level would be join all tokens with whitespace and continue building doc object with that.
Currently the PR will address: initializing a Doc object with a list of pre-separated sentences given as List[str]
Is there a PR for this ? You are simply asking something similar to from_sentence
over Doc
. Such as from_token
?
This was lowprio for me so I did not work on a PR for this. My initial thought on this matter were a user would supply a list of tokens List[str]
where sentence boundaries are explicitly stated with an <eos>
or <\s>
token by the user. Then Doc
object will be constructed by from_tokens
.
Note to self: Tokenizer interface should be easily extendable so that users can add their custom tokenizers if they so desire.
When users:
bert
andsimple
tokenizers.They should be able to feed their tokenized text as
List[str]
,List[List[str]]
toDoc
with anis_tokenized
flag.