WorksApplications SudachiTra issues

WorksApplications / SudachiTra

Japanese tokenizer for Transformers

Apache License 2.0

77 stars 10 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Fixes #66 - sudachitra not being compatible with transformers version newer than 4.34

#67 mingboiz closed 8 months ago
5
sudachitra and other custom tokenizers no longer compatible with transformers later than 4.34

#66 mingboiz closed 8 months ago
4
Can I use a user dictionary?

#65 mumumu09chi opened 1 year ago
2
The entry of `\n` in `vocab.txt` is causing token index shifting

#64 hiroshi-matsuda-rit opened 1 year ago
0
Introduce token-based authentication for PyPI

#63 mh-northlander opened 1 year ago
0
setup.py install is deprecated.

#62 mh-northlander opened 1 year ago
0
Update python-publish workflow

#61 mh-northlander closed 1 year ago
2
Python publish workflow is not kicked on the release

#60 mh-northlander closed 1 year ago
1
Prepare for chiTra-1.1

#59 mh-northlander closed 1 year ago
0
Prepare for v0.1.8

#58 mh-northlander closed 1 year ago
0
Vocabulary file handling

#57 mh-northlander opened 1 year ago
0
Add changelog file

#56 mh-northlander closed 1 year ago
0
Add patch file for the JGLUE evaluation

#55 mh-northlander closed 1 year ago
0
Allow to save vocab with non-consecutive indices

#54 mh-northlander closed 1 year ago
3
Allow empty line in the vocab file

#53 mh-northlander closed 1 year ago
0
Evaluate model with JGLUE

#52 mh-northlander closed 1 year ago
0
tokenizer.model_max_length is incorrect

#51 mh-northlander opened 1 year ago
1
Feather/add normalized nouns

#50 katsutan closed 1 year ago
0
add workflow_dispatch

#49 t-yamamura closed 1 year ago
2
Support 接尾辞-動詞的 and 接尾辞-形容詞的

#48 KoichiYasuoka closed 1 year ago
4
update document with the release of pretraining models

#47 t-yamamura closed 2 years ago
0
fix README for pretraining

#46 t-yamamura closed 2 years ago
0
Update README for pretraining

#45 t-yamamura closed 2 years ago
0
Update README for pretraing

#44 t-yamamura closed 2 years ago
0
Tokenizer initializations behave differently

#43 mh-northlander opened 2 years ago
0
Add to the test for alignments of encoded tokens by `JapaneseBertWordPieceTokenizer`

#42 t-yamamura opened 2 years ago
0
use `pathlib` instead of `os.path`

#41 t-yamamura opened 2 years ago
0
pretraining by NVIDIA

#40 katsutan closed 2 years ago
1
Make `split_dataset.py` support huge file input.

#39 t-yamamura closed 2 years ago
2
Feature/use huggingface compatible pretokenizer

#38 t-yamamura closed 2 years ago
1
Add scripts for the model evaluation

#37 mh-northlander closed 1 year ago
2
use PosMatcher instead of `part_of_speech()`

#36 t-yamamura closed 2 years ago
0
Feature/conjugation preserving normalize for subword

#35 t-yamamura closed 2 years ago
0
Fix/modify merged preprocessing codes

#34 t-yamamura closed 2 years ago
0
Use scripts for pretraining implemented by NVIDIA

#33 t-yamamura closed 2 years ago
0
Feature/add cleaning and preprocessing

#32 t-yamamura closed 2 years ago
0
add normalizer that leaved conjugation

#31 katsutan closed 2 years ago
2
require sudachipy>=0.6.0

#30 t-yamamura closed 2 years ago
0
remove slow tokenizer

#29 t-yamamura closed 2 years ago
0
remove slow tokenizer

#28 t-yamamura closed 2 years ago
0
add NFKC normalization

#27 t-yamamura closed 2 years ago
0
use NFKC as preprocessing

#26 t-yamamura closed 2 years ago
0
remove lowercase normalizer

#25 t-yamamura closed 2 years ago
0
Remove lowercase normalizer

#24 t-yamamura closed 2 years ago
0
Add preprocessing for cleaning up corpus

#23 t-yamamura closed 2 years ago
0
Replace SudachiPy with sudachi.rs

#22 t-yamamura closed 2 years ago
0
improve default configurations

#21 hiroshi-matsuda-rit closed 3 years ago
0
fix slow tokenizer

#20 t-yamamura closed 3 years ago
0
add slow tokenizer

#19 t-yamamura closed 3 years ago
0
Re-register submodule

#18 t-yamamura closed 3 years ago
0