eliben / go-sentencepiece

Go implementation of the SentencePiece tokenizer
Apache License 2.0
20 stars 2 forks source link

add api to get vocab size #3

Closed josharian closed 2 months ago

josharian commented 2 months ago

For me to use this, I need a way to get the vocab size.

josharian commented 2 months ago

My cgo implementation just calls:

int vocab_size(struct cspm_processor *proc) {
  return proc->processor.GetPieceSize();
}
eliben commented 2 months ago

You don't really need this package for that, though? You can just unmarshal the model, e.g. this part of the constructor:

    b, err := os.ReadFile(protoFile)
    if err != nil {
        return nil, fmt.Errorf("unable to read %q: %v", protoFile, err)
    }

    var mp ModelProto
    err = proto.Unmarshal(b, &mp)
    if err != nil {
        return nil, fmt.Errorf("unable to unmarshal %q: %v", protoFile, err)
    }

And then access whatever fields of the proto you need

josharian commented 2 months ago

Right. But per #4, I was hoping to massage this into a cleaner top level API.

eliben commented 2 months ago

I have to think some more about this.

What you call "vocabulary size" is just one configuration option that can be obtained from the proto. There's a bunch of other metrics that could be exposed, but once you get down that path, why not expose the whole ModelProto? However, this feels silly because one doesn't need this package (go-sentencepiece) to just grab a ModelProto from a protobuf file you have access to anyway - one can just read the protobuf using the protobuf package itself.

eliben commented 2 months ago

The just-released v0.4.0 has a VocabularySize method: https://pkg.go.dev/github.com/eliben/go-sentencepiece#Processor.VocabularySize

Please reopen if needed

eliben commented 2 months ago

Please note that in v0.5.0 there's been a slight API change and this information is now available in https://pkg.go.dev/github.com/eliben/go-sentencepiece@v0.5.0#Processor.ModelInfo