eliben / go-sentencepiece

Go implementation of the SentencePiece tokenizer
Apache License 2.0
20 stars 2 forks source link

how do i decode? #2

Closed josharian closed 2 months ago

josharian commented 2 months ago

I see an Encoder, but to use the output of a gemma model I need to Decode too.

It might be better to rename Encoder to Model or the like, so it can do decoding as well.

josharian commented 2 months ago

In case it is helpful, here's my current implementation using cgo.

func (m *Model) Decode(tokens []uint32) string {
    if len(tokens) == 0 {
        return ""
    }
    cres := C.decode(m.proc, (*C.uint32_t)(&tokens[0]), (C.size_t)(len(tokens)))
    if cres == nil {
        panic("failed to decode")
    }
    out := C.GoString(cres)
    C.free(unsafe.Pointer(cres))
    return out
}
char *decode(struct cspm_processor *proc, const uint32_t *input, size_t length) {
  std::vector<std::string> pieces;
  for (size_t i = 0; i < length; i++) {
    pieces.push_back(proc->processor.IdToPiece(input[i]));
  }
  std::string result;
  proc->processor.Decode(pieces, &result);
  return strdup(result.c_str());
}
eliben commented 2 months ago

Understood. Decoding wasn't really in scope when I set to work on this; my goal was to implement the encoder. Decoding would definitely require some additional work.

eliben commented 2 months ago

Added a decoder in v0.4.0, and renamed s/Encoder/Processor/ (aligned with the name in the C++/Python implementation)

https://pkg.go.dev/github.com/eliben/go-sentencepiece#Processor.Decode

Please reopen if needed