Hk669 bpetokenizer issues - Githubissues

Hk669 / bpetokenizer

(py package) train your own tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)

https://pypi.org/project/bpetokenizer/

2 stars 1 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

use go for multithreading<> increase the performance

#16 Hk669 opened 3 weeks ago
2
deprecate: file mode in load and save

#15 Hk669 closed 3 weeks ago
0
fix: special_tokens in the encode method to support the special tokens in the vocab

#13 Hk669 closed 3 weeks ago
0
`special_tokens` in the encode method doesn't work for the BPETokenizer

#12 Hk669 closed 3 weeks ago
0
Updates for the pretrained tokenizers.

#11 Hk669 closed 3 weeks ago
0
feat: starttime-endtime added with the throughput on verbose

#10 Hk669 closed 3 weeks ago
0
start-endtime and throughput required to estimate the time required to train the tokenizer.

#9 Hk669 closed 3 weeks ago
1
deprecated the __version__ check when loading

#8 Hk669 closed 3 weeks ago
0
deprecate the __version__ check when loading the tokenizer.

#7 Hk669 closed 3 weeks ago
1
feat: from_pretrained enabled with wi17k_base

#6 Hk669 closed 3 weeks ago
0
make the `from_pretrained` method available to load the tokenizers from pretrained.

#5 Hk669 closed 3 weeks ago
1
Deprecate the save/load mode= "file" for the tokenizer.

#4 Hk669 closed 3 weeks ago
0
Add batch preprocessing if the training dataset is huge.

#1 Hk669 closed 3 weeks ago
8