ajbouh / substrate

7 stars 3 forks source link

bridge: stage prototype #10

Open ajbouh opened 6 months ago

ajbouh commented 6 months ago

plumbing to cast a stage to nearby Chromecast or Apple TV

This requires a URL showing a WebRTC stream. Need to confirm this works for AppleTV+Chromecast. Also need to wrap some discovery/setup/auth. Needs mdns on the deployed network.

assistant api for stages

This API would be exposed to assistant LLM via reflection:

type StageAPI interface {
    CreateStage() (int, error) // create new chromestage instance
    CloseStage(id int) error // close chromestage instance

    OpenURL(id int, url string) error
    OpenBlank(id int) error

    // high level raw dom operations, presumably on blank page
    AppendParagraph(id int, text string) error
    AppendImage(id int, url string) error

    CastStage(id int, dest string) error // requires setup, via functions? config?
    ...
}

An initial version can act on a single stage, so stage id management isn't necessary. Initially we'll just hit the Docker API directly to spawn a chromestage instance for each stage and keep a chromedp connection to it.

show stage in UI frontend

When stages are used we want it to show up in the web UI, which we'll do by either running a VNC client to chromestage or having chromestage push webrtc to bridge.

progrium commented 5 months ago

Although the assistant commands depends on #65 (and indirectly #64), the other parts can be started immediately and I would suggest prioritizing my time towards this.

progrium commented 4 months ago

Today I got started on this. First thing was getting chromestage's debug service exposed and usable outside the container. First attempt was to just expose the debug port in the Docker container, but the service would connect and reset the connection. I thought maybe it was that it wasn't listening on 0.0.0.0, but that didn't make a difference and would say connection refused if that were that case. The Go process inside was able to connect fine, so I just made it proxy connections and that worked. Weird, but also we'd need a proxy anyway for VNC.

I wanted to use VNC for the display in bridge, so I made sure noVNC worked with the in-container VNC service. It worked fine using its websocket proxy. I'm pretty sure this is literally tunneling TCP over WebSocket, so it'll be easy to roll this into the chromestage service. So with that, both VNC and chromedp can be used over HTTP setting it up to be used on substrate with the gateway.

I asked about how we set up "stateful" instances of containers and it seems I have a path forward on that, so Monday I can start getting this into Substrate repo. I'll just make bridge start up with a chromestage instance with env var set, which will spawn the container and show the VNC display on the page. I'll probably make some dumb voice activated commands (not using LLM chooser) to control the browser and then we should have a working, integrated proof of concept to mess around with, ready to eventually be controlled by assistant commands with #65 ...

progrium commented 4 months ago

notes from discord on using substrate for stateful services:

The basic idea (still finishing the commit) is to declare that your service uses a "space"

package defs

enable: "gotty": true

"lenses": "gotty": {
  spawn: {
    ...
    parameters: {
      data: {
        type: "space"
      }
    }
  }
}

In this case, when you visit /gw/gotty it will see that "data" is expected but unassigned, so it will allocate a fresh directory and then redirect you to (for example) /gw/gotty[data=sp-01HP5MSFQFXHH838EVBWV5P47R]/

Inside of the container, we mount that space at /spaces/data and you can write to /spaces/data/tree

So for chromestage we would expect to keep the profile directory in a space that would make it possible to have bookmarks and tabs that we can come back to

There's also a concept of spawn parameters that are environment variables, so you some creativity is possible with that approach

progrium commented 4 months ago

@ajbouh https://github.com/ajbouh/substrate/pull/139

besides using substrate to launch, which i might not be able to test until i get back in about a week, there are some minor things to fix up here. i can just make it pluggable so it can launch a subprocess for now and let a substrate launcher replace that. this way i can get it to be per session and only activate when triggered by a command. the commands are also all dumb text processing on transcriptions, but to take this to the next level with proper API exposed to assistants, we need to get #65 in.

there are also a few quirks where it doesn't always load pages as i mentioned in chat. these are mostly all docker or container configuration things.

in the meantime, it could be worth making sure we can stream webrtc to appletv/chromecast and making sure we have the setup workflow for those working.

ajbouh commented 4 months ago

Now that we have a version of chromestage that can run within substrate, is viewable in browser, and has a working chromedp interface, we need to make some decisions about how bridge can productively interact with things via chromestage.

There are a few naive options:

  1. a custom HTTP-driven API that "happens to" update what's shown in chromestage (so bridge issues HTTP requests to the same HTTP service that chromestage is visiting)
  2. bridge has hardcoded DOM-based operations it will do on specific pages (e.g. it knows the node ID of the textarea to "type" in, the ID of the button to "click", etc.)

Neither of these are great for the same underlying reason: they tightly couple the behavior of the service to bridge's internal logic.

Instead, there may be a useful tradeoff somewhere between the two.

window.substrate

We can introduce a top-level global inside every page called: window.substrate. This global will be set before page load is complete. Bridge can interact with this top-level global to learn about the page and take page-specific commands.

Here's what I'm thinking for the "interface" of this top-level global:

interface SubstrateCommandField {
  description: string
  type: 'string' | 'number' // for now we only accept strings or numbers
}

interface SubstrateCommand {
  description: string
  parameters: Record<string, SubstrateCommandField>
  returns: Record<string, SubstrateCommandField>
}

interface SubstrateGlobal {
  // wrap everything in a top-level r0 object, so we can experiment with multiple interfaces at once without breaking everything.
  r0: {
    command: {
      // This object is intentionally JSON-able.
      index: Record<string, SubstrateCommand>
      run: (commandName: string, parameters: Record<string, SubstrateCommandField>) => Record<string, SubstrateCommandField>
    }
  }
}

To use this, bridge could interact with any application within substrate or on the greater web by:

  1. Use chromedp to navigate to URL
  2. Use chromedp to waiting for load to complete
  3. Use chromedp to evaluate window.substrate.r0.command.index.
  4. To the discovered commands, add some basic defaults. These might include: go to, reload, refresh, go back, scroll down, click, type.
  5. On each new text event from the user, use the current set of commands to prompt a function-calling model. If a function is selected by the model, run it.
  6. To run a function defined by the page, we can use window.substrate.r0.command.run(...)

With the approach outlined above we can quickly write interfaces that are usable by people and by agents. We can keep loose coupling between the commands we "export" and how the DOM itself is structured.

This approach has some limitations that are fine for now. It does not handle situations where commands might change, as would be common in a SPA. It also does not support commands whose return values might be a "stream" of events, like we might have when "collaborating" with an agent while jointly writing a bit of text.

These sorts of event-driven interactions might be achievable by using console.log along with something like: https://pkg.go.dev/github.com/chromedp/chromedp#ListenTarget, but they are out of scope for now.

ajbouh commented 4 months ago

@progrium ^

mgood commented 4 months ago
  1. bridge has hardcoded DOM-based operations it will do on specific pages (e.g. it knows the node ID of the textarea to "type" in, the ID of the button to "click", etc.)

A similar option that would be less coupled to bridge is if we can use some of the ARIA accessibility roles to annotate forms and buttons on the page instead of looking for special IDs. Bridge being able to issue voice commands seems related to what screen readers do for voice input, so we may be able to share some of those conventions. It doesn't have to be comprehensive, but maybe we can support a small specific subset: https://www.w3.org/WAI/ARIA/apg/patterns/landmarks/examples/form.html

Having the voice commands visible as a form on the page would give the reader some visual documentation of the structure of the inputs, as well as enabling them to input directly via the keyboard when that's more convenient. This may not fully replace the need for the window.substrate option to directly call JS, but could be a useful way to present commands that can be called via voice while still working as a regular form in any browser.

Understanding the ARIA stuff could also lead to some interesting future possibilities in improving access to other sites for users that have limitations using a keyboard & mouse.

ajbouh commented 4 months ago

Yes, ARIA could be good, though it would likely require some level of multi-step planning to really be used properly.

Today I think most models that actually work are essentially limited to "single step" execution of plans.

That said if we can make our stuff ARIA-compatible that would be good in general and if it means more productive constraints to embrace, all the better.

mgood commented 4 months ago

I've looked at #65 a bit, but is there anything else describing the broader scope of what this should be capable of handling? Maybe I'm reading too much into it, but it seems like I could use this to write an arbitrary SPA and make it "bridge compatible" by exposing metadata about various voice commands.

Part of what I'm trying to understand is where the responsibility lies for handling responses. The interface above includes a schema for the return, though is it up to bridge to decide how to handle the return structure? Or is the SPA responsible for implementing run such that it performs the necessary updates?

E.g. if I built a calendar app and wanted a voice command to schedule a meeting, it seems like it would be up to the app to handle the response and update the UI rather than bridge being responsible for displaying the response.

At first I was thinking about it from the point of view of mapping between the voice commands and something like a form with a button and text inputs. Though as I was writing this up, maybe it's more equivalent to the VSCode "Command Palette" style interface?

Maybe if the command palette seems like an appropriate HTML analogue we could prototype this in a way that the app can provide both interfaces from the same list of commands.

ajbouh commented 4 months ago

Yes, there are strong analogs to a command palette.

However, I haven't seen any good open source command palettes that are both good and take parameters.

I have always sorta assumed that someday we'd just make a thing that looks like a legit terminal.

The substrate /ui service does use ninjakeys, though it's not currently being properly populated. You can click the top right command icon to see it pop up. It will be empty though.

Many of the use cases we have on the near term roadmap are either parameterized (like SetText), or return something to be used by other things, (like GetText).

It's very possible that we'll expect bridge to be able to do special things with single field return values. For example we might say: "reformat this text".

The other thing that we care about is less the concept of "voice commands" and more "natural-language-driven commands". Once we get things going we might expect a particular assistant to be in a loop interacting with a page in real-time.

For example we might tell bridge to "take notes on the meeting and write them here." Or to "make note of questions people ask and whether or not they are answered."

Other more advanced possibilities might include a codegen model writing a tool for another assistant to make use of.

There are a lot of possibilities for how this might evolve but we won't really know what they look like until we start playing with it.

progrium commented 4 months ago

@ajbouh is there are reason the arguments and return value are modeled as a key-value mapping as opposed to something more aligned with the intrinsic structure of functions?

ajbouh commented 4 months ago

Yes, a few main reasons:

progrium commented 4 months ago

Thank you that's helpful. Let me share some thoughts as to why I'd change it.

  • Function calling datasets usually model function parameters as key/value pairs.

This is a big mistake on their part for assuming kwargs as a universal. There is a reason most IDL/schemas modeling functions have some kind of ordering information. BUT it's ok that these datasets don't have it as long as we have it somewhere, either in the schema we expose or through reflection (which is possible but not ideal with JS). I just prefer to model what are effectively function signatures where possible with that information.

  • We need a ddl of some kind to describe the types and purpose of function parameters and return values. Rather than inventing something with a lot of surface area that is very specific, it seemed simpler to describe them with the same approach

IMO this isn't a DDL it's an IDL and if you're modeling functions its a Good Idea to have ordering info in arguments.

  • Tools like CUE provide much richer tools for unifying objects than they provide for lists

You can still have an intermediate representation that is an object and work with systems without kwargs as long as something describes the order.

  • Many of my previous experiments with browser DOM FFIs have a lot of similar functions that differ only by "a field on a return value". For example, consider a function that is about finding and extracting text from a DOM node. We might want the innerText or the innerHTML. While the current spec does not indicate that you must request specific return fields, a future extension that does so would be straightforward.

I would hate for this to become GraphQL.

Maybe there is no specific reason yet to treat commands as functions other than making things simpler (any function can become a command), but in my experience its a good idea. Anyway, I wanted to understand why this wasn't obvious or if there were other good reasons for avoiding it, but it sounds like there shouldn't be a problem with making some minor changes to this API to make it aligned with a more "universal" concept of command/function.

ajbouh commented 4 months ago

I agree about the IDL vs DDL bit.

In my mind these aren't functions, they are messages sent between processes implemented in separate codebases. And IMHO, thinking in terms of object in, object out makes more sense as a free standing message to send between systems. This is partly because the thing that defines the command and the thing that discoveres/uses the command can only really afford fairly loose coupling.

Ideally we won't have cascading refactorings when we decide to tweak the arguments these commands accept. By ignoring ordering we can avoid reverberations from adding, reordering and optional arguments.

All that said...

If there's something that feels a bit better or easier to implement, that's fine ... whatever we do is going to need iteration and will probably be wrong in some unanticipated way.

Broadly speaking I think it's an open question how we should be treating these boundaries between systems. I think we'll learn tons more as we wire things up and try to evolve everything.