Closed txdat closed 5 years ago
using cython module for c++ code wrapping
add segment_general (based on segment_original) in tokenizer/tokenizer.hpp for:
segment_general
segment_original
tokenizer/tokenizer.hpp
keep punctuations
keep case-sensitive
remove empty (space/underscore) tokens
using cython module for c++ code wrapping
add
segment_general
(based onsegment_original
) intokenizer/tokenizer.hpp
for:keep punctuations
keep case-sensitive
remove empty (space/underscore) tokens