V/v giữ nguyên hoa/thường cho bản Java wrapper

behitek commented 4 years ago

Cám ơn Cốc Cốc đã phát triển bộ công cụ tách từ với độ chính xác cao, tốc độ rất nhanh.

Mình đang sử dụng thử thì thấy khi build cho Java thì văn bản bị đưa về hết chữ thường, mình cũng có thử xem phần Java code nhưng không thấy và chưa tìm ra cách để chỉnh lại.

$ LD_LIBRARY_PATH=build java -cp build/coccoc-tokenizer.jar com.coccoc.Tokenizer "một câu văn tiếng Việt"
một     câu_văn     tiếng_việt  .

Mong nhận được sự giúp đỡ!

anhducle98 commented 4 years ago

Please check the branch https://github.com/coccoc/coccoc-tokenizer/tree/java-case-senstitive. I've just made a quick fix there to show that it's possible, by gathering characters from the original text within each token's [originalStartPos, originalEndPos).

The branch is not merge-able yet because it still has duplicated code fragments and doesn't include generated spaces (for sticky text).

behitek commented 4 years ago

Thank you, It's working fine :100: But i think this shouble be the default option, and keep keep_puncts para!

coccoc / coccoc-tokenizer

V/v giữ nguyên hoa/thường cho bản Java wrapper #11