bigcode-project / starcoder

Home of StarCoder: fine-tuning & inference!
Apache License 2.0
7.28k stars 518 forks source link

Can/How StarCoder model can be used for encoding? #55

Open Symbolk opened 1 year ago

Symbolk commented 1 year ago

Beside the well-kown ChatGPT, now more and more startups and researchers note the great value and potential in OpenAI embedding API (https://platform.openai.com/docs/guides/embeddings). It enables many domain-specific adaptation and applications, like LLaMa-index, soft prompting, retrieval-augmented generation, etc.

Therefore, I wonder if StarCoder can be used for encoding? If the anwser is Yes, how should we make it usable? By modifying the network layers or solely the inference code?

I know there is StarEncoder~125M, is it already ok for encoding?

xpl commented 1 year ago

I believe as it's a decoder-only architecture, you can't encode with it?

But correct me if I'm wrong.

realfenston commented 1 year ago

Any luck with this?

ramsey-coding commented 1 year ago

@dpfried @lvwerra can you please help?

lvwerra commented 1 year ago

You can always get the hidden states of the model and use those as embeddings. We have never benchmarked how good they are for the decoder but @joaomonteirof has benchmarked the encoder models a bit!

joaomonteirof commented 1 year ago

I think StarCoder's top layer hidden states could work well. For StarEncoder, we did some code-to-code retrieval evaluations after pre-training and results were quite promising. Relevant discussions on how to get chunk-level embeddings here: