Writing Assistance APIs

domenic commented 3 months ago

Introduction

Browsers and operating systems are increasingly expected to gain access to a language model. (Example, example, example.) Web applications can benefit from using language models for a variety of use cases.

We're proposing a group of APIs that use language models to give web developers high-level assistance with writing. Specifically:

The summarizer API produces summaries of input text;
The writer API writes new material, given a writing task prompt;
The rewriter API transforms and rephrases input text in the requested ways.

Because these APIs share underlying infrastructure and API shape, and have many cross-cutting concerns, we include them all in one explainer, to avoid repeating ourselves across three repositories. However, they are separate API proposals, and can be evaluated independently.

Read the complete Explainer.

Feedback

I welcome feedback in this thread, but encourage you to file bugs against the Explainer.

nileshtrivedi commented 3 months ago

The APIs as proposed seem too specific, too early. These will end up thwarting experiments rather than encourage them. This should be a JS library rather than an API standard at this time.

ai is not just language models. It's possible that the built-in AI model is multi-modal and supports image/audio/video processing along with text and therefore, would need different semantics and capabilities for these use cases.
window.ai might be just a proxy/wrapper for remote, perhaps even paid, models (See https://webllm.mlc.ai/ and https://openrouter.ai/ ). These models will have all kinds of capabilities and the API should allow leveraging those.
The summarizer,writer and rewriter are high-level use-cases which are being defined too early. Developers still haven't started using the Prompt API. IMHO, it seems desirable that developers become familiar with prompt engineering of local AI models. Instead, the proposed APIs will hard-code one particular set of prompts into the platform.

In summary, this proposal is trying to standardize new technical capabilities in a new ecosystem which is fast evolving and likely to drastically change. While the Prompt API is reasonably necessary, high-level use cases should not be standardized so soon.

petamoriken commented 3 months ago

Can we change to navigator.ai? https://github.com/WICG/proposals/issues/158#issuecomment-2291111059

slightlyoff commented 3 months ago

It's also somewhat surprising that these API sketches don't have permissions associated with them. Many implementations will likely plumb AI work down to a (shared) system model for resource reasons; why are we proposing to expose potentially shared state to sites without user consent?

domenic commented 3 months ago

Can we change to navigator.ai? #158 (comment)

No, as this is not part of the navigator or operating system environment. https://github.com/w3ctag/design-principles/issues/448#issuecomment-2219295220

Regarding naming in general, please see https://github.com/explainers-by-googlers/writing-assistance-apis/blob/main/README.md#alternative-api-spellings

It's also somewhat surprising that these API sketches don't have permissions associated with them. Many implementations will likely plumb AI work down to a (shared) system model for resource reasons; why are we proposing to expose potentially shared state to sites without user consent?

The shared state aspect is discussed in https://github.com/explainers-by-googlers/writing-assistance-apis/blob/main/README.md#the-capabilities-apis , including the possibility of user prompts and permissions.

josephrocca commented 3 months ago

Context: I am in the target audience for a feature like this - I run multiple web applications that use cloud services for these sorts of tasks. I am very grateful and excited that the Chrome team is trying to push the envelope here.

I have a few thoughts:

1. Well-designed high-level APIs could be useful (but it may be too early):

High-level tasks like summarization are non-trivial to implement "manually" in a real world setting if you want them to be robust to document length, and be performant (in terms of accuracy, but also speed - things like designing for prefix cache hits in your prompts can speed up inference by 3x or more in long-context tasks). In this respect, a high-level API may be very useful for developers, but I'm concerned that it's too early to form good abstraction to solve this pain point.

To illustrate the potential value of a high-level API, via the difficulty of writing a robust summarization API (i.e. when you're not just building a "hello world" demo):

You start with something like await session.prompt("Summarize this text: "+text), but eventually realize that the model isn't smart enough to summarize large amounts of text in one go, so you chunk the document and summarize a few paragraphs at a time.
Then you realize that the model should ideally be able to see previous summaries in the chain so that it has context on who "Jane" and "Bob" are in the current chunk.
So you now show the previous two summaries and the result is better.
But you realize it's still not sufficient - the model needs to see more of the previous text. Okay, so now we really need to roll up our sleeves - we a kind of 'hierarchical' summarization, so the model sees all previous text, but at a "resolution"/"compression" level that allows it to fit within the model's context limit, and also not 'overwhelm' the model.
That works now, but now it's really slow, because a single "step" of your summary can now involve several steps - a 'cascade' of summarizations up the hierarchy, before the actual summary of this next chunk can begin.
You realize that you can get a 3x speedup by leaning on "automatic prefix caching" - i.e. if the prefix of the prompt doesn't change, then the LLM inference framework doesn't have to process that prefix before starting to generate text. So you basically want to hold off on the hierarchical summarization for several steps (resulting in a 'build up' of a few 'uncompressed' chunks/paragraphs at the end), and then do several cascades at once.
That works, but your document is growing incrementally over time (e.g. conference call, chat thread, etc.) and when you finally trigger those "several cascades at once" it causes a "stall" for 30+ seconds in your ~live/progressive summarization, and the summarization is needed to support other features, so it makes those other features unusable/inaccurate while the summaries are outdated due to this "several cascades at once" thing.
So you realize that you should be doing the cascades in the background, and then only adding them to your summarization prompt when you're ready to "ruin"/"miss" your prefix cache (which you have to do at some point - you just don't want to do it on every single chunk you process).

I could go on; I have scars, but you get the idea. It's not insurmountably difficult, but there is a lot of complexity that you have to tackle in real world tasks. If this proposal is not interested in "robustly" solving real-world summarization, then it is not worth having a high-level API for this, since it's very easy to use the Prompt API to do this - it's literally a one-liner:

await session.prompt(`Summarize this text into a short paragraph (maximum 5 sentences) using a formal tone: ${text}`)

And if you need more consistent results (i.e. independent of model used), then you can be more explicit with the prompt/instruction.

In case it's helpful to see the thickness of the tail of the distribution of real-world summarization tasks here: Another use case that immediately comes to mind is one where I need to specify a prefix for the summary to ensure the LLM "approaches" the summary in a very particular way. In another case I need a summary in a bullet point format, where each bullet point only represents things that are "permanent" (e.g. birthplace or a person), and not things that could change (e.g. what they're wearing right now). In another one I require the LLM to enclose names of people in square brackets, and never use pronouns to reference people.

In my experience, almost every project/application other than very simple "hello world" types (which again are just one-liners with the Prompt API) has it's own unique requirements like the examples above. I'd be much more comfortable with user-land experimentation before identifying pain points and solidifying high-level APIs to solve them.

2. Safety/ethics aspects should be completely controllable by the developer:

This isn't specific to these writing APIs - i.e. it applies to the Prompt API too. If Google's LLM safety policies are etched into my application (for Chrome browser users), and Apple's policies for Apple users, and so on - then this will likely end up making the API too painful to use. It should be possible to completely disable any kind of automated flagging/filtering.

There is DX pain here that needs to be experienced to believed. If I ask the model to summarize some text that is calling for the immediate and brutal genocide of an ethnic group, then it should summarize the text - nothing more, nothing less. It's basically impossible to build a whole subset of applications that would be extremely valuable for society (in my case, a component of a semi-automated moderation system for online chat) thanks to an almost-impressive lack of imagination of some LLM service providers.

It's the developer that publishes the web applications who should be burdened with questions of safety - of their for loops, and of their window.ai calls. Please encode something akin to "developers can control safety filters" within the proposal, even if you're only able to use "should"-level language.

3. Benefits list items are all about on-device inference, but then README mentions cloud APIs several times?

The "Why built-in?" section lists benefits like: "Local processing of sensitive data [...] no server round-trip involved [...] Offline usage [...] save the user's bandwidth [...]". All the key benefits are based on the offline aspect - which makes sense, since developers like myself can and do already use cloud services.

But then the README goes on to mention:

(Shared goal:) Allow a variety of implementation strategies, including on-device or cloud-based models, while keeping these details abstracted from developers.
(Uncertain goal:) Allow web developers to know, or control, whether language model interactions are done on-device or using cloud services.
(Privacy:) perhaps we should make it easier for web developers to know whether a cloud-based model is in use, or which one.

What is the benefit of cloud-based models, and how would they be funded? E.g. is this like Chrome's Web Speech API, which runs on Google's own servers for free? Or would the browser user need to sign up to a provider and give their API key? The latter seems impractical.

If the "cloud fallback" runs for free on Google's own servers (for Chrome users), then that is concerning to me. The Web Speech API took the same approach, and it's unusably bad on Firefox because they don't have the Google-level resources to provide free compute to billions of people. So the Web Speech API is (or was, last I checked) basically unusable due to how bad Firefox's on-device voices are. This would be a much more severe issue with LLMs due to how compute-intensive they can be. If this API becomes important, then browser companies like Firefox would be in a rough spot, and new browser projects would be dead on arrival unless they were backed by a very large company.

domenic commented 3 months ago

Thanks all for the valuable feedback!

One misconception I'm seeing here repeatedly is people evaluating these APIs as additions to the prompt API. It'd be best for people to evaluate whether they're useful in a world where there is no prompt API. After all, the prompt API is not yet proposed to WICG or elsewhere---for good reason, as reaching interoperability on it could be quite challenging. We tried to make the relationship clear:

Even more so than many other behind-a-flag APIs, the prompt API is an experiment, designed to help us understand web developers' use cases to inform a roadmap of purpose-built APIs.

The writing assistance APIs are precisely the "purpose-built APIs" alluded to.

Of course, as @josephrocca mentions, using the behind-the-flag prompt API to help understand the requirements for robust writing assistance APIs is very valuable, and we would definitely use the feedback gathered from such experiences to make the writing assistance APIs robust.

What is the benefit of cloud-based models, and how would they be funded?

The intent here is the opposite of what you seem to be imputing. It's actually to be more inclusive, by not requiring browsers to ship their own language model. This allows more browsers to participate in implementing these APIs, not fewer.

josephrocca commented 3 months ago

by not requiring browsers to ship their own language model. This allows more browsers to participate in implementing these APIs

That makes sense, but I'm talking about a different concern. Rather than "(in)ability to ship a feature" I'm talking about "(in)ability to provide free cloud compute to make a potentially-very-important feature competitive with very-well-funded browser companies".

Click for further explanation

Google/Microsoft/etc. may not currently *intend* to use their ability to provide billions of free cloud compute hours as a competitive advantage (e.g. for low-end phones, say), but if it ends up being a very important feature, then I can see possible futures that aren't great for browser competition. The Web Speech API is a case where Google and other very-well-funded browsers were able to get ahead of competition (like Firefox) by providing free cloud compute to their users. As much as I love Chrome (it's all I use), and the Chrome team's work pushing the web forward (I've yet to forgive Firefox over File System Access API), I can see that this sort of thing is probably not healthy for the web in the long term. Whether or not this will be a *significant* concern is up for people to debate, but it's definitely a concern.

It'd be best for people to evaluate whether they're useful in a world where there is no prompt API. [...] as reaching interoperability on it could be quite challenging

Ah, I see. It's honestly hard for me (and perhaps others too given that, as you said, it is a common misconception) to imagine having only high-level "task" APIs in this area, given that a whole LLM is being downloaded to the device, and perhaps given that it seems difficult to cover much of the use-case distribution of this (very general-purpose) technology with only task-specific APIs.

That said, since the writer API "writes new material, given a writing task prompt", I think it may actually be lower-level than the Prompt API depending on some specifics (which I admit I should probably better-acquaint myself with before commenting here).

I currently exclusively use "instruct format" (akin to single-turn chat) rather than "chat format" in development, since the former is a lower-level format within which you can easily implement chat-style experiences, but with more control (e.g. the common chat formats aren't really designed for multiple non-user "characters" in the same thread). The instruct prompt is simply:

Here's a chat log:
---
[chat logs formatted however you like]
---
Please write the next message.

or something roughly to that effect (often with added role/character definitions, and sometimes extra context like when using retrieval-augmented generation techniques).

So, assuming "The writer API writes new material, given a writing task prompt" can be taken fairly literally, I think you can basically replace all mentions of "Prompt API" with "Writer API" in my earlier comment, and I'd probably advocate for releasing only the Writer API for now, and holding out of task-specific, higher-level APIs like summarization and rewriting until/if pain-points are found.

tomgould commented 2 months ago

Great work from what's seen so far, looking forward to integrating these into some of my projects +1;

domenic commented 2 months ago

Hey WICG chairs! There's been some spirited discussion here, with both interest and disinterest from many quarters. I think that shows us that it's worth incubating these ideas in a forum with good IPR and contribution policies like WICG. What do you say?

chrisn commented 2 months ago

While this is more an implementation than an API issue, to be consistent with the principles of an ethical web let's make sure that the models behind these APIs are open data, and that the training data has been ethically sourced - by which I mean with explicit consent for this purpose.

josephrocca commented 2 months ago

let's make sure that the models behind these APIs are open data, and that the training data has been ethically sourced - by which I mean with explicit consent for this purpose.

I think it should be made more explicit that this is a call for debate on fair/ethical use of publicly available data for training predictive models.

I say this because the comment appears to make assertions which would be more appropriately posed as opinions or questions, given that there are no overwhelming arguments or consensus in either direction.

I'm not sure that this is the place for such a debate, but I agree that these are important questions.

(For the record, open data/weights may be feasible, but explicit opt-in for all training data would likely make this API impossible - unless synthetic data from upstream models [with less restrictive data policies] were used.)

domenic commented 2 months ago

Thanks @cwilso for helping us complete the move to WICG! I look forward to more discussions with you all over there. https://github.com/WICG/writing-assistance-apis

WICG / proposals