issues
search
WorksApplications
/
SudachiTra
Japanese tokenizer for Transformers
Apache License 2.0
77
stars
10
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Fixes #66 - sudachitra not being compatible with transformers version newer than 4.34
#67
mingboiz
closed
8 months ago
5
sudachitra and other custom tokenizers no longer compatible with transformers later than 4.34
#66
mingboiz
closed
8 months ago
4
Can I use a user dictionary?
#65
mumumu09chi
opened
1 year ago
2
The entry of `\n` in `vocab.txt` is causing token index shifting
#64
hiroshi-matsuda-rit
opened
1 year ago
0
Introduce token-based authentication for PyPI
#63
mh-northlander
opened
1 year ago
0
setup.py install is deprecated.
#62
mh-northlander
opened
1 year ago
0
Update python-publish workflow
#61
mh-northlander
closed
1 year ago
2
Python publish workflow is not kicked on the release
#60
mh-northlander
closed
1 year ago
1
Prepare for chiTra-1.1
#59
mh-northlander
closed
1 year ago
0
Prepare for v0.1.8
#58
mh-northlander
closed
1 year ago
0
Vocabulary file handling
#57
mh-northlander
opened
1 year ago
0
Add changelog file
#56
mh-northlander
closed
1 year ago
0
Add patch file for the JGLUE evaluation
#55
mh-northlander
closed
1 year ago
0
Allow to save vocab with non-consecutive indices
#54
mh-northlander
closed
1 year ago
3
Allow empty line in the vocab file
#53
mh-northlander
closed
1 year ago
0
Evaluate model with JGLUE
#52
mh-northlander
closed
1 year ago
0
tokenizer.model_max_length is incorrect
#51
mh-northlander
opened
1 year ago
1
Feather/add normalized nouns
#50
katsutan
closed
1 year ago
0
add workflow_dispatch
#49
t-yamamura
closed
1 year ago
2
Support 接尾辞-動詞的 and 接尾辞-形容詞的
#48
KoichiYasuoka
closed
1 year ago
4
update document with the release of pretraining models
#47
t-yamamura
closed
2 years ago
0
fix README for pretraining
#46
t-yamamura
closed
2 years ago
0
Update README for pretraining
#45
t-yamamura
closed
2 years ago
0
Update README for pretraing
#44
t-yamamura
closed
2 years ago
0
Tokenizer initializations behave differently
#43
mh-northlander
opened
2 years ago
0
Add to the test for alignments of encoded tokens by `JapaneseBertWordPieceTokenizer`
#42
t-yamamura
opened
2 years ago
0
use `pathlib` instead of `os.path`
#41
t-yamamura
opened
2 years ago
0
pretraining by NVIDIA
#40
katsutan
closed
2 years ago
1
Make `split_dataset.py` support huge file input.
#39
t-yamamura
closed
2 years ago
2
Feature/use huggingface compatible pretokenizer
#38
t-yamamura
closed
2 years ago
1
Add scripts for the model evaluation
#37
mh-northlander
closed
1 year ago
2
use PosMatcher instead of `part_of_speech()`
#36
t-yamamura
closed
2 years ago
0
Feature/conjugation preserving normalize for subword
#35
t-yamamura
closed
2 years ago
0
Fix/modify merged preprocessing codes
#34
t-yamamura
closed
2 years ago
0
Use scripts for pretraining implemented by NVIDIA
#33
t-yamamura
closed
2 years ago
0
Feature/add cleaning and preprocessing
#32
t-yamamura
closed
2 years ago
0
add normalizer that leaved conjugation
#31
katsutan
closed
2 years ago
2
require sudachipy>=0.6.0
#30
t-yamamura
closed
2 years ago
0
remove slow tokenizer
#29
t-yamamura
closed
2 years ago
0
remove slow tokenizer
#28
t-yamamura
closed
2 years ago
0
add NFKC normalization
#27
t-yamamura
closed
2 years ago
0
use NFKC as preprocessing
#26
t-yamamura
closed
2 years ago
0
remove lowercase normalizer
#25
t-yamamura
closed
2 years ago
0
Remove lowercase normalizer
#24
t-yamamura
closed
2 years ago
0
Add preprocessing for cleaning up corpus
#23
t-yamamura
closed
2 years ago
0
Replace SudachiPy with sudachi.rs
#22
t-yamamura
closed
2 years ago
0
improve default configurations
#21
hiroshi-matsuda-rit
closed
3 years ago
0
fix slow tokenizer
#20
t-yamamura
closed
3 years ago
0
add slow tokenizer
#19
t-yamamura
closed
3 years ago
0
Re-register submodule
#18
t-yamamura
closed
3 years ago
0
Next