Closed josharian closed 3 months ago
My cgo implementation just calls:
int vocab_size(struct cspm_processor *proc) {
return proc->processor.GetPieceSize();
}
You don't really need this package for that, though? You can just unmarshal the model, e.g. this part of the constructor:
b, err := os.ReadFile(protoFile)
if err != nil {
return nil, fmt.Errorf("unable to read %q: %v", protoFile, err)
}
var mp ModelProto
err = proto.Unmarshal(b, &mp)
if err != nil {
return nil, fmt.Errorf("unable to unmarshal %q: %v", protoFile, err)
}
And then access whatever fields of the proto you need
Right. But per #4, I was hoping to massage this into a cleaner top level API.
I have to think some more about this.
What you call "vocabulary size" is just one configuration option that can be obtained from the proto. There's a bunch of other metrics that could be exposed, but once you get down that path, why not expose the whole ModelProto
? However, this feels silly because one doesn't need this package (go-sentencepiece
) to just grab a ModelProto
from a protobuf file you have access to anyway - one can just read the protobuf using the protobuf package itself.
The just-released v0.4.0 has a VocabularySize
method: https://pkg.go.dev/github.com/eliben/go-sentencepiece#Processor.VocabularySize
Please reopen if needed
Please note that in v0.5.0 there's been a slight API change and this information is now available in https://pkg.go.dev/github.com/eliben/go-sentencepiece@v0.5.0#Processor.ModelInfo
For me to use this, I need a way to get the vocab size.