Open SoluMilken opened 5 years ago
init
: get normalizer by idfit
: passtransform
: call norm.normalizeinverse_transform
: call norm.denormalizestate
: statelessinit
: input words to be tokenized and get tokenizer by idfit
: passtransform
: tokenizer.lcutinverse_transform
: list of string to string (join).state
: stateless.init
: input sos, eos, pad token, maxlenfit
:
transform
:
inverse_transform
init
: vocab_size, vocabulary, unk_token, important_tokensfit
:
transform
: map token to index based on vocabulary, inverse_transform
: index to tokenstate
: vocabularyThe param important_tokens
of token to index
should be renamed.
I suggest necessary_token
NormalizeOp -> NormOp?
token to index Op (with hash)
init
: vocab_size, vocabulary, necessary_tokens fit
: build limited size vocabulary from sentences + (add necessary_tokens must have) transform
: map token to index based on vocabularyinverse_transform
: index to token state
: vocabularyPadOp 就交給 @absolutelyNoWarranty 幫修
augment_single_partition_utterances augment_single_partition_utterances_with_upper_bound
identity_text_normalizer whitespace_char_text_normalizer basic_text_normalizer_collection eng_basic_text_normalizer_collection simplified_punctuation_keeping_text_normalizer_collection number_with_digits_text_normalizer_collection number_with_digits_n_simplified_punctuation_text_normalizer_collection chinese_charactor_text_normalizer_collection_1 chinese_charactor_text_normalizer_collection_3
Word2vecEmbedder Seq2VecOneHot Seq2Vec3DEmbedder
CustomJiebaTokenizer PureWordsTokenizer ChineseCharTokenizer PureChineseCharTokenizer NltkTokenizer NltkCustomJiebaTokenizer
Normalizers can be covered by #35
We should avoid large op as long as possible. They are hard to maintain, optimize or even be backward-compatible. so we should deprecate Normalizer
op.
NOTE: change title name since it's a public repo.
https://github.com/Yoctol/babble/blob/c79647d0150604149abdcd33d3302ba9011b4688/data_loader/text_mapper/text_mapper.py