issues
search
Hk669
/
bpetokenizer
(py package) train your own tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)
https://pypi.org/project/bpetokenizer/
2
stars
1
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
use go for multithreading<> increase the performance
#16
Hk669
opened
3 weeks ago
2
deprecate: file mode in load and save
#15
Hk669
closed
3 weeks ago
0
fix: special_tokens in the encode method to support the special tokens in the vocab
#13
Hk669
closed
3 weeks ago
0
`special_tokens` in the encode method doesn't work for the BPETokenizer
#12
Hk669
closed
3 weeks ago
0
Updates for the pretrained tokenizers.
#11
Hk669
closed
3 weeks ago
0
feat: starttime-endtime added with the throughput on verbose
#10
Hk669
closed
3 weeks ago
0
start-endtime and throughput required to estimate the time required to train the tokenizer.
#9
Hk669
closed
3 weeks ago
1
deprecated the __version__ check when loading
#8
Hk669
closed
3 weeks ago
0
deprecate the __version__ check when loading the tokenizer.
#7
Hk669
closed
3 weeks ago
1
feat: from_pretrained enabled with wi17k_base
#6
Hk669
closed
3 weeks ago
0
make the `from_pretrained` method available to load the tokenizers from pretrained.
#5
Hk669
closed
3 weeks ago
1
Deprecate the save/load mode= "file" for the tokenizer.
#4
Hk669
closed
3 weeks ago
0
Add batch preprocessing if the training dataset is huge.
#1
Hk669
closed
3 weeks ago
8