Yoctol / strpipe

text preprocessing pipeline
Other
5 stars 0 forks source link

Candidate ops #27

Open SoluMilken opened 5 years ago

absolutelyNoWarranty commented 5 years ago

https://github.com/Yoctol/babble/blob/c79647d0150604149abdcd33d3302ba9011b4688/data_loader/text_mapper/text_mapper.py

absolutelyNoWarranty commented 5 years ago

Ops

normalizeOp

tokenizeOp

padOp

token to index (with unk token)

stegben commented 5 years ago

The param important_tokens of token to index should be renamed. I suggest necessary_token

SoluMilken commented 5 years ago

NormalizeOp -> NormOp?

SoluMilken commented 5 years ago

token to index Op (with hash)

SoluMilken commented 5 years ago

PadOp 就交給 @absolutelyNoWarranty 幫修

absolutelyNoWarranty commented 5 years ago

utterance augmentors - not part of strpipe

augment_single_partition_utterances augment_single_partition_utterances_with_upper_bound

text normalizers - #35

identity_text_normalizer whitespace_char_text_normalizer basic_text_normalizer_collection eng_basic_text_normalizer_collection simplified_punctuation_keeping_text_normalizer_collection number_with_digits_text_normalizer_collection number_with_digits_n_simplified_punctuation_text_normalizer_collection chinese_charactor_text_normalizer_collection_1 chinese_charactor_text_normalizer_collection_3

embedders - not part of strpipe

Word2vecEmbedder Seq2VecOneHot Seq2Vec3DEmbedder

tokenizers - #33 #46

CustomJiebaTokenizer PureWordsTokenizer ChineseCharTokenizer PureChineseCharTokenizer NltkTokenizer NltkCustomJiebaTokenizer

stegben commented 5 years ago

Normalizers can be covered by #35

stegben commented 5 years ago

We should avoid large op as long as possible. They are hard to maintain, optimize or even be backward-compatible. so we should deprecate Normalizer op.

stegben commented 5 years ago

NOTE: change title name since it's a public repo.