Ops

normalizeOp

init: get normalizer by id
fit: pass
transform: call norm.normalize
inverse_transform: call norm.denormalize
state: stateless

tokenizeOp

init: input words to be tokenized and get tokenizer by id
fit: pass
transform: tokenizer.lcut
inverse_transform: list of string to string (join).
state: stateless.

padOp

init: input sos, eos, pad token, maxlen
fit:
- (1) check default tokens not in sentences
- (2) compute maxlen of sentences
state: maxlen + 2 if sos, eos both exists.
transform:
- (1) add sos, eos token
- (2) pad to fixed length
inverse_transform
- (1) unpad sentences
- (2) remove sos, eos

token to index (with unk token)

init: vocab_size, vocabulary, unk_token, important_tokens
fit:
- (1) check unk token not in sentences
- (2) build limited size vocabulary from sentences + (add unk token + add important_tokens must have)
transform: map token to index based on vocabulary,
inverse_transform: index to token
state: vocabulary

stegben commented 5 years ago

The param important_tokens of token to index should be renamed. I suggest necessary_token

SoluMilken commented 5 years ago

NormalizeOp -> NormOp?

SoluMilken commented 5 years ago

token to index Op (with hash)

init: vocab_size, vocabulary, necessary_tokens
fit: build limited size vocabulary from sentences + (add necessary_tokens must have)
transform: map token to index based on vocabulary
inverse_transform: index to token
state: vocabulary

SoluMilken commented 5 years ago

PadOp 就交給 @absolutelyNoWarranty 幫修

absolutelyNoWarranty commented 5 years ago

utterance augmentors - not part of strpipe

augment_single_partition_utterances augment_single_partition_utterances_with_upper_bound

text normalizers - #35

identity_text_normalizer whitespace_char_text_normalizer basic_text_normalizer_collection eng_basic_text_normalizer_collection simplified_punctuation_keeping_text_normalizer_collection number_with_digits_text_normalizer_collection number_with_digits_n_simplified_punctuation_text_normalizer_collection chinese_charactor_text_normalizer_collection_1 chinese_charactor_text_normalizer_collection_3

embedders - not part of strpipe

Word2vecEmbedder Seq2VecOneHot Seq2Vec3DEmbedder

tokenizers - #33 #46

CustomJiebaTokenizer PureWordsTokenizer ChineseCharTokenizer PureChineseCharTokenizer NltkTokenizer NltkCustomJiebaTokenizer

stegben commented 5 years ago

Normalizers can be covered by #35

stegben commented 5 years ago

We should avoid large op as long as possible. They are hard to maintain, optimize or even be backward-compatible. so we should deprecate Normalizer op.

stegben commented 5 years ago

NOTE: change title name since it's a public repo.

Yoctol / strpipe

Candidate ops #27

Ops

normalizeOp

tokenizeOp

padOp

token to index (with unk token)

utterance augmentors - not part of strpipe

text normalizers - #35

embedders - not part of strpipe

tokenizers - #33 #46