All-Hands-AI / OpenHands

🙌 OpenHands: Code Less, Make More
https://all-hands.dev
MIT License
32.92k stars 3.77k forks source link

WebSocket API #44

Closed yimothysu closed 6 months ago

yimothysu commented 6 months ago

It seems to me that the frontend is primarily displaying what OpenDevin is doing to the user for visibility. The actual agent is implemented on the backend.

We'll therefore want to stream a lot of information from the backend to the frontend via WebSockets and/or Server-Sent Events. Each module of OpenDevin should receive its own events.

Below is a draft of what the events for such a WebSocket API might look like.

Terminal

terminal writes to the terminal. terminal.write(...) is a function in xterm.js, so we can forward the terminal sequences directly from the backend to the frontend. the paylod might look like

{
    "content": "\x1B[1;3;31OpenDevin\x1B[0m $"
}

Planner

planner writes to the planner in MarkDown format, which the frontend renders. we could reuse the same payload as the code endpoint below since the planner state can be represented as a single .md file.

Code

code streams code, which the frontend renders syntax-highlighted in a code editor. the code may be stored in a string array, where each element is a line of code. the payload might look like

{
    "line": 109
    "change": "INSERT",
    "content": [
        "with open(\"tmp.txt\") as f:",
        "\tcontent = f.read()"
    ]
}

Browser

navigate navigates to a URL and sends a screenshot every second (or every page change). the frontend displays this URL and screenshot.

it's possible to render an <iframe />, but 1) this seems unnecessary because the backend already needs to access pages via Selenium 2) this can have security/reliability issues (such as CORS)

the payload might look like

{
    "url": "https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html",
    "screenshot": "data:image/png;base64, ..."
}
enyst commented 6 months ago

What do you think about using the diff patch format for code changes? Since it seems that the SWE bench requires results in diff format, it would allow us to reuse it. On the other hand, our frontend may handle the format you suggest more easily.

yimothysu commented 6 months ago

Using the diff patch format is possible, but requires more preprocessing. We can also run SWE bench headless and use git to generate diffs.

rbren commented 6 months ago

I like the idea of using code to do the markdown plan. The agent tries to write markdown sometimes anyways--if we can just tell it to always use DevinPlan.md, that will kill two birds with one stone.

rbren commented 6 months ago

I also like the idea of file read/write going over the wire, instead of the agent editing files directly (which is currently what my agent implementation does).

For browse, IMO we'll get a lot more mileage by sending HTML instead of screenshots. 1 screenshot per second would be a lot to process.

rbren commented 6 months ago

For code edits, we'll probably also want to be able to replace a range of lines. I.e. "replace lines 60-100 with this new code"

rbren commented 6 months ago

FYI: I have an implementation of the websocket handshake here (but with zero of the operations above): https://github.com/OpenDevin/OpenDevin/pull/57

yimothysu commented 6 months ago

@rbren

For browse, IMO we'll get a lot more mileage by sending HTML instead of screenshots. 1 screenshot per second would be a lot to process.

Totally, I should have specified this is for server : frontend communication. The server (or perhaps agent) should spin up a Selenium instance. The HTML is sent from Selenium to the agent while a screenshot of the current webpage in Selenium should be sent to the frontend per page change.


For code edits, we'll probably also want to be able to replace a range of lines. I.e. "replace lines 60-100 with this new code"

This is equivalent to a 40-line DELETE followed by an INSERT. Do you think we should have an explicit REPLACE change type?

rbren commented 6 months ago

Totally, I should have specified this is for server : frontend communication. The server (or perhaps agent) should spin up a Selenium instance. The HTML is sent from Selenium to the agent while a screenshot of the current webpage in Selenium should be sent to the frontend per page change.

Awesome, agree. pyppetteer could be worth investigating.

This is equivalent to a 40-line DELETE followed by an INSERT. Do you think we should have an explicit REPLACE change type?

As far as what the LLM will do, it'll be much easier for it to REPLACE than do a DELETE followed by an INSERT--at least if we're limiting it to 1 action per prompt (currently the case, but up for discussion)

Though we could always "translate" an LLM replace command into a delete+insert

yimothysu commented 6 months ago

Makes sense, either way seems reasonable to me!

rbren commented 6 months ago

I have a first pass at a websocket API here: https://github.com/OpenDevin/OpenDevin/pull/97

The client opens a websocket, and then client and server pass messages JSON messages back and forth. Both client and server messages have the same format:

A typical flow would look like this: User:

{"action": "start", {"args": {"task": "write a bash script that prints hello"}}

Server:

{"message":"Starting new agent..."}
{"action":"run","message":"Running command: ls","args":{"command":"ls"}}
{"action":"output","message":"Got output.","args":{"output":"LICENSE\nOpenDevinLogo.jpg\nREADME.md\nagenthub\nenv_name\nevaluation\nfrontend\nhello.sh\nopendevin\nrequirements.txt\nserver\nworkspace\n"}}
{"action":"read","message":"Reading file: hello.sh","args":{"path":"hello.sh"}}
{"action":"output","message":"Got output.","args":{"output":"#!/bin/bash\necho \"hello\""}}
{"action":"run","message":"Running command: bash hello.sh","args":{"command":"bash hello.sh"}}
{"action":"output","message":"Got output.","args":{"output":"hello\n"}}
{"action":"think","message":"I've successfully executed the bash script hello.sh which printed 'hello'. My primary task is complete. It's time to finalize my work.","args":{"thought":"I've successfully executed the bash script hello.sh which printed 'hello'. My primary task is complete. It's time to finalize my work."}}
{"action":"finish","message":"Finished!","args":{}}

The user can also issue actions like

{"action":"run", "args":{"command":"git commit -a -m 'save work'"}}
rbren commented 6 months ago

I think we can close this one now