Closed JoeZijunZhou closed 4 months ago
Just chiming in here that the customer quoted is us :). The main challenger is that we have clients in multiple languages that don’t always have tokenizer implementations readily available. Every other prominent model server does detokenization, hence the request.
Doesn’t hurt that there are so many CPU cores on the TPU VMs that are mostly idle during inference anyway.
Thanks @JoeZijunZhou !
Resolved this issue in #78 . It's available in main.