ankane / tokenizers-ruby

Fast state-of-the-art tokenizers for Ruby
Apache License 2.0
132 stars 6 forks source link

Add ByteFallback, Fuse, Replace, and Strip decoders. Added Prepend normalizer. Also added byte_fallback config option to BPE tokenizer. #29

Closed petergoldstein closed 1 year ago

petergoldstein commented 1 year ago

This includes a general cargo update to pick up the tokenizers update, but which also updated a number of other dependencies.

One odd thing in the Python implementation that I chose to do a little differently. The named parameters on the Strip decoder constructor are left and right. But the getters/setters are for start and stop. Originally the getters/setters were also left and right, but they were subsequently updated, while the constructor parameters were left unchanged. I chose to make them consistent in the Ruby binding and just use start and stop.

Runs green on my fork.

petergoldstein commented 1 year ago

Also add the Prepend normalizer

ankane commented 1 year ago

Looks great, thanks @petergoldstein

petergoldstein commented 1 year ago

I think that's the second time I've gotten that character wrong...

ankane commented 1 year ago

Ha, all good