ankane / tokenizers-ruby

Fast state-of-the-art tokenizers for Ruby
Apache License 2.0
132 stars 6 forks source link

Adds a number of Python library equivalent functions to the Ruby interface #9

Closed petergoldstein closed 1 year ago

petergoldstein commented 1 year ago

This PR adds the following methods to Tokenizer:

and the following methods to Encoding:

The Python Tokenizer and Encoding bindings were used as a reference.

There are some additional updates that I'd like to make as a follow up. Most notably:

  1. Updating the encode method to better match the complete signature here, including support for pair and pretokenization
  2. Some version of encode_batch, as the sequence methods don't really serve a purpose without it. We can't easily get the parallelism benefit though.

But this batch seemed pretty straightforward on its own and of reasonable benefit.

ankane commented 1 year ago

Awesome, thanks again @petergoldstein! The follow up changes sound good as well.