Open ajbouh opened 6 months ago
Although the assistant commands depends on #65 (and indirectly #64), the other parts can be started immediately and I would suggest prioritizing my time towards this.
Today I got started on this. First thing was getting chromestage's debug service exposed and usable outside the container. First attempt was to just expose the debug port in the Docker container, but the service would connect and reset the connection. I thought maybe it was that it wasn't listening on 0.0.0.0, but that didn't make a difference and would say connection refused if that were that case. The Go process inside was able to connect fine, so I just made it proxy connections and that worked. Weird, but also we'd need a proxy anyway for VNC.
I wanted to use VNC for the display in bridge, so I made sure noVNC worked with the in-container VNC service. It worked fine using its websocket proxy. I'm pretty sure this is literally tunneling TCP over WebSocket, so it'll be easy to roll this into the chromestage service. So with that, both VNC and chromedp can be used over HTTP setting it up to be used on substrate with the gateway.
I asked about how we set up "stateful" instances of containers and it seems I have a path forward on that, so Monday I can start getting this into Substrate repo. I'll just make bridge start up with a chromestage instance with env var set, which will spawn the container and show the VNC display on the page. I'll probably make some dumb voice activated commands (not using LLM chooser) to control the browser and then we should have a working, integrated proof of concept to mess around with, ready to eventually be controlled by assistant commands with #65 ...
The basic idea (still finishing the commit) is to declare that your service uses a "space"
package defs
enable: "gotty": true
"lenses": "gotty": {
spawn: {
...
parameters: {
data: {
type: "space"
}
}
}
}
In this case, when you visit /gw/gotty
it will see that "data" is expected but unassigned, so it will allocate a fresh directory and then redirect you to (for example) /gw/gotty[data=sp-01HP5MSFQFXHH838EVBWV5P47R]/
Inside of the container, we mount that space at /spaces/data
and you can write to /spaces/data/tree
So for chromestage we would expect to keep the profile directory in a space that would make it possible to have bookmarks and tabs that we can come back to
There's also a concept of spawn parameters that are environment variables, so you some creativity is possible with that approach
@ajbouh https://github.com/ajbouh/substrate/pull/139
besides using substrate to launch, which i might not be able to test until i get back in about a week, there are some minor things to fix up here. i can just make it pluggable so it can launch a subprocess for now and let a substrate launcher replace that. this way i can get it to be per session and only activate when triggered by a command. the commands are also all dumb text processing on transcriptions, but to take this to the next level with proper API exposed to assistants, we need to get #65 in.
there are also a few quirks where it doesn't always load pages as i mentioned in chat. these are mostly all docker or container configuration things.
in the meantime, it could be worth making sure we can stream webrtc to appletv/chromecast and making sure we have the setup workflow for those working.
Now that we have a version of chromestage that can run within substrate, is viewable in browser, and has a working chromedp interface, we need to make some decisions about how bridge can productively interact with things via chromestage.
There are a few naive options:
Neither of these are great for the same underlying reason: they tightly couple the behavior of the service to bridge's internal logic.
Instead, there may be a useful tradeoff somewhere between the two.
window.substrate
We can introduce a top-level global inside every page called: window.substrate
. This global will be set before page load is complete. Bridge can interact with this top-level global to learn about the page and take page-specific commands.
Here's what I'm thinking for the "interface" of this top-level global:
interface SubstrateCommandField {
description: string
type: 'string' | 'number' // for now we only accept strings or numbers
}
interface SubstrateCommand {
description: string
parameters: Record<string, SubstrateCommandField>
returns: Record<string, SubstrateCommandField>
}
interface SubstrateGlobal {
// wrap everything in a top-level r0 object, so we can experiment with multiple interfaces at once without breaking everything.
r0: {
command: {
// This object is intentionally JSON-able.
index: Record<string, SubstrateCommand>
run: (commandName: string, parameters: Record<string, SubstrateCommandField>) => Record<string, SubstrateCommandField>
}
}
}
To use this, bridge could interact with any application within substrate or on the greater web by:
window.substrate.r0.command.index
.window.substrate.r0.command.run(...)
With the approach outlined above we can quickly write interfaces that are usable by people and by agents. We can keep loose coupling between the commands we "export" and how the DOM itself is structured.
This approach has some limitations that are fine for now. It does not handle situations where commands might change, as would be common in a SPA. It also does not support commands whose return values might be a "stream" of events, like we might have when "collaborating" with an agent while jointly writing a bit of text.
These sorts of event-driven interactions might be achievable by using console.log
along with something like: https://pkg.go.dev/github.com/chromedp/chromedp#ListenTarget, but they are out of scope for now.
@progrium ^
- bridge has hardcoded DOM-based operations it will do on specific pages (e.g. it knows the node ID of the textarea to "type" in, the ID of the button to "click", etc.)
A similar option that would be less coupled to bridge is if we can use some of the ARIA accessibility roles to annotate forms and buttons on the page instead of looking for special IDs. Bridge being able to issue voice commands seems related to what screen readers do for voice input, so we may be able to share some of those conventions. It doesn't have to be comprehensive, but maybe we can support a small specific subset: https://www.w3.org/WAI/ARIA/apg/patterns/landmarks/examples/form.html
Having the voice commands visible as a form on the page would give the reader some visual documentation of the structure of the inputs, as well as enabling them to input directly via the keyboard when that's more convenient.
This may not fully replace the need for the window.substrate
option to directly call JS, but could be a useful way to present commands that can be called via voice while still working as a regular form in any browser.
Understanding the ARIA stuff could also lead to some interesting future possibilities in improving access to other sites for users that have limitations using a keyboard & mouse.
Yes, ARIA could be good, though it would likely require some level of multi-step planning to really be used properly.
Today I think most models that actually work are essentially limited to "single step" execution of plans.
That said if we can make our stuff ARIA-compatible that would be good in general and if it means more productive constraints to embrace, all the better.
I've looked at #65 a bit, but is there anything else describing the broader scope of what this should be capable of handling? Maybe I'm reading too much into it, but it seems like I could use this to write an arbitrary SPA and make it "bridge compatible" by exposing metadata about various voice commands.
Part of what I'm trying to understand is where the responsibility lies for handling responses. The interface above includes a schema for the return, though is it up to bridge to decide how to handle the return structure? Or is the SPA responsible for implementing run
such that it performs the necessary updates?
E.g. if I built a calendar app and wanted a voice command to schedule a meeting, it seems like it would be up to the app to handle the response and update the UI rather than bridge
being responsible for displaying the response.
At first I was thinking about it from the point of view of mapping between the voice commands and something like a form with a button and text inputs. Though as I was writing this up, maybe it's more equivalent to the VSCode "Command Palette" style interface?
Maybe if the command palette seems like an appropriate HTML analogue we could prototype this in a way that the app can provide both interfaces from the same list of commands.
Yes, there are strong analogs to a command palette.
However, I haven't seen any good open source command palettes that are both good and take parameters.
I have always sorta assumed that someday we'd just make a thing that looks like a legit terminal.
The substrate /ui service does use ninjakeys, though it's not currently being properly populated. You can click the top right command icon to see it pop up. It will be empty though.
Many of the use cases we have on the near term roadmap are either parameterized (like SetText), or return something to be used by other things, (like GetText).
It's very possible that we'll expect bridge to be able to do special things with single field return values. For example we might say: "reformat this text".
The other thing that we care about is less the concept of "voice commands" and more "natural-language-driven commands". Once we get things going we might expect a particular assistant to be in a loop interacting with a page in real-time.
For example we might tell bridge to "take notes on the meeting and write them here." Or to "make note of questions people ask and whether or not they are answered."
Other more advanced possibilities might include a codegen model writing a tool for another assistant to make use of.
There are a lot of possibilities for how this might evolve but we won't really know what they look like until we start playing with it.
@ajbouh is there are reason the arguments and return value are modeled as a key-value mapping as opposed to something more aligned with the intrinsic structure of functions?
Yes, a few main reasons:
Function calling datasets usually model function parameters as key/value pairs.
We need a ddl of some kind to describe the types and purpose of function parameters and return values. Rather than inventing something with a lot of surface area that is very specific, it seemed simpler to describe them with the same approach
Tools like CUE provide much richer tools for unifying objects than they provide for lists
Many of my previous experiments with browser DOM FFIs have a lot of similar functions that differ only by "a field on a return value". For example, consider a function that is about finding and extracting text from a DOM node. We might want the innerText or the innerHTML. While the current spec does not indicate that you must request specific return fields, a future extension that does so would be straightforward.
Thank you that's helpful. Let me share some thoughts as to why I'd change it.
- Function calling datasets usually model function parameters as key/value pairs.
This is a big mistake on their part for assuming kwargs as a universal. There is a reason most IDL/schemas modeling functions have some kind of ordering information. BUT it's ok that these datasets don't have it as long as we have it somewhere, either in the schema we expose or through reflection (which is possible but not ideal with JS). I just prefer to model what are effectively function signatures where possible with that information.
- We need a ddl of some kind to describe the types and purpose of function parameters and return values. Rather than inventing something with a lot of surface area that is very specific, it seemed simpler to describe them with the same approach
IMO this isn't a DDL it's an IDL and if you're modeling functions its a Good Idea to have ordering info in arguments.
- Tools like CUE provide much richer tools for unifying objects than they provide for lists
You can still have an intermediate representation that is an object and work with systems without kwargs as long as something describes the order.
- Many of my previous experiments with browser DOM FFIs have a lot of similar functions that differ only by "a field on a return value". For example, consider a function that is about finding and extracting text from a DOM node. We might want the innerText or the innerHTML. While the current spec does not indicate that you must request specific return fields, a future extension that does so would be straightforward.
I would hate for this to become GraphQL.
Maybe there is no specific reason yet to treat commands as functions other than making things simpler (any function can become a command), but in my experience its a good idea. Anyway, I wanted to understand why this wasn't obvious or if there were other good reasons for avoiding it, but it sounds like there shouldn't be a problem with making some minor changes to this API to make it aligned with a more "universal" concept of command/function.
I agree about the IDL vs DDL bit.
In my mind these aren't functions, they are messages sent between processes implemented in separate codebases. And IMHO, thinking in terms of object in, object out makes more sense as a free standing message to send between systems. This is partly because the thing that defines the command and the thing that discoveres/uses the command can only really afford fairly loose coupling.
Ideally we won't have cascading refactorings when we decide to tweak the arguments these commands accept. By ignoring ordering we can avoid reverberations from adding, reordering and optional arguments.
All that said...
If there's something that feels a bit better or easier to implement, that's fine ... whatever we do is going to need iteration and will probably be wrong in some unanticipated way.
Broadly speaking I think it's an open question how we should be treating these boundaries between systems. I think we'll learn tons more as we wire things up and try to evolve everything.
plumbing to cast a stage to nearby Chromecast or Apple TV
This requires a URL showing a WebRTC stream. Need to confirm this works for AppleTV+Chromecast. Also need to wrap some discovery/setup/auth. Needs mdns on the deployed network.
assistant api for stages
This API would be exposed to assistant LLM via reflection:
An initial version can act on a single stage, so stage id management isn't necessary. Initially we'll just hit the Docker API directly to spawn a chromestage instance for each stage and keep a chromedp connection to it.
show stage in UI frontend
When stages are used we want it to show up in the web UI, which we'll do by either running a VNC client to chromestage or having chromestage push webrtc to bridge.