Incorporate Llama Model

shiffman commented 5 months ago

I am going to first try running the agent with with local version of llama running via https://ollama.com/.

However, an ideal solution might be to use transformers.js and run the model directly in node.js. One reason to use llama is that I know how to fine-tune the model which I made ultimately want to do.

shiffman commented 5 months ago

This is ready for an initial review. It works! However, there is a lot to determine:

Right now I am streaming from the model, the stream is actually faster than the "realistic typing" delay, so it's possible I may want to slow it down.
I should probably ask the model to tell me if it's code or narration, maybe I ask the model to output JSON. Another option would be to use two different models, maybe code-llama for example.

Would love any initial thoughts or advice on what I have so far!

@dipamsen are you able to run ollama or no b/c you are on PC?

dipamsen commented 5 months ago

@shiffman I can run ollama (there's a preview version for windows), but it runs very slowly...

dipamsen commented 5 months ago

Regarding distinguishing between code and narration, this needs to be done by prompt engineering, such that the model uses specific markers (eg. [SPEECH] and [CODE]) so separate them. Alternatively, responding with json could also work. Using two models probably won't be ideal as it may cause a incoherence between the speech and the code.

shiffman commented 5 months ago

Regarding distinguishing between code and narration, this needs to be done by prompt engineering, such that the model uses specific markers (eg. [SPEECH] and [CODE]) so separate them. Alternatively, responding with json could also work. Using two models probably won't be ideal as it may cause a incoherence between the speech and the code.

Agreed! I've had success with this kind of prompt engineering before (ShiffBot does similar kinds of things), but one thing I'm unsure how to approach is parsing the results while "streaming" the response from the model. . . . . (without streaming there is a lot of latency before a reply comes in.)

supercrafter100 commented 5 months ago

If it always starts with the type of response it's going to do (eg: [SPEECH]) you can keep a buffer that stores for example the last 10 characters said, then just check if it contains those types every character and switch what to do with the following output accordingly.

CodingTrain / Bizarro-Devin

Incorporate Llama Model #25