Closed domenic closed 2 months ago
The APIs as proposed seem too specific, too early. These will end up thwarting experiments rather than encourage them. This should be a JS library rather than an API standard at this time.
ai
is not just language models. It's possible that the built-in AI model is multi-modal and supports image/audio/video processing along with text and therefore, would need different semantics and capabilities for these use cases.window.ai
might be just a proxy/wrapper for remote, perhaps even paid, models (See https://webllm.mlc.ai/ and https://openrouter.ai/ ). These models will have all kinds of capabilities and the API should allow leveraging those.summarizer
,writer
and rewriter
are high-level use-cases which are being defined too early. Developers still haven't started using the Prompt API. IMHO, it seems desirable that developers become familiar with prompt engineering of local AI models. Instead, the proposed APIs will hard-code one particular set of prompts into the platform.In summary, this proposal is trying to standardize new technical capabilities in a new ecosystem which is fast evolving and likely to drastically change. While the Prompt API is reasonably necessary, high-level use cases should not be standardized so soon.
Can we change to navigator.ai
?
https://github.com/WICG/proposals/issues/158#issuecomment-2291111059
It's also somewhat surprising that these API sketches don't have permissions associated with them. Many implementations will likely plumb AI work down to a (shared) system model for resource reasons; why are we proposing to expose potentially shared state to sites without user consent?
Can we change to
navigator.ai
? #158 (comment)
No, as this is not part of the navigator or operating system environment. https://github.com/w3ctag/design-principles/issues/448#issuecomment-2219295220
Regarding naming in general, please see https://github.com/explainers-by-googlers/writing-assistance-apis/blob/main/README.md#alternative-api-spellings
It's also somewhat surprising that these API sketches don't have permissions associated with them. Many implementations will likely plumb AI work down to a (shared) system model for resource reasons; why are we proposing to expose potentially shared state to sites without user consent?
The shared state aspect is discussed in https://github.com/explainers-by-googlers/writing-assistance-apis/blob/main/README.md#the-capabilities-apis , including the possibility of user prompts and permissions.
Context: I am in the target audience for a feature like this - I run multiple web applications that use cloud services for these sorts of tasks. I am very grateful and excited that the Chrome team is trying to push the envelope here.
I have a few thoughts:
High-level tasks like summarization are non-trivial to implement "manually" in a real world setting if you want them to be robust to document length, and be performant (in terms of accuracy, but also speed - things like designing for prefix cache hits in your prompts can speed up inference by 3x or more in long-context tasks). In this respect, a high-level API may be very useful for developers, but I'm concerned that it's too early to form good abstraction to solve this pain point.
To illustrate the potential value of a high-level API, via the difficulty of writing a robust summarization API (i.e. when you're not just building a "hello world" demo):
await session.prompt("Summarize this text: "+text)
, but eventually realize that the model isn't smart enough to summarize large amounts of text in one go, so you chunk the document and summarize a few paragraphs at a time.I could go on; I have scars, but you get the idea. It's not insurmountably difficult, but there is a lot of complexity that you have to tackle in real world tasks. If this proposal is not interested in "robustly" solving real-world summarization, then it is not worth having a high-level API for this, since it's very easy to use the Prompt API to do this - it's literally a one-liner:
await session.prompt(`Summarize this text into a short paragraph (maximum 5 sentences) using a formal tone: ${text}`)
And if you need more consistent results (i.e. independent of model used), then you can be more explicit with the prompt/instruction.
In case it's helpful to see the thickness of the tail of the distribution of real-world summarization tasks here: Another use case that immediately comes to mind is one where I need to specify a prefix for the summary to ensure the LLM "approaches" the summary in a very particular way. In another case I need a summary in a bullet point format, where each bullet point only represents things that are "permanent" (e.g. birthplace or a person), and not things that could change (e.g. what they're wearing right now). In another one I require the LLM to enclose names of people in square brackets, and never use pronouns to reference people.
In my experience, almost every project/application other than very simple "hello world" types (which again are just one-liners with the Prompt API) has it's own unique requirements like the examples above. I'd be much more comfortable with user-land experimentation before identifying pain points and solidifying high-level APIs to solve them.
This isn't specific to these writing APIs - i.e. it applies to the Prompt API too. If Google's LLM safety policies are etched into my application (for Chrome browser users), and Apple's policies for Apple users, and so on - then this will likely end up making the API too painful to use. It should be possible to completely disable any kind of automated flagging/filtering.
There is DX pain here that needs to be experienced to believed. If I ask the model to summarize some text that is calling for the immediate and brutal genocide of an ethnic group, then it should summarize the text - nothing more, nothing less. It's basically impossible to build a whole subset of applications that would be extremely valuable for society (in my case, a component of a semi-automated moderation system for online chat) thanks to an almost-impressive lack of imagination of some LLM service providers.
It's the developer that publishes the web applications who should be burdened with questions of safety - of their for
loops, and of their window.ai
calls. Please encode something akin to "developers can control safety filters" within the proposal, even if you're only able to use "should"-level language.
The "Why built-in?" section lists benefits like: "Local processing of sensitive data [...] no server round-trip involved [...] Offline usage [...] save the user's bandwidth [...]". All the key benefits are based on the offline aspect - which makes sense, since developers like myself can and do already use cloud services.
But then the README goes on to mention:
What is the benefit of cloud-based models, and how would they be funded? E.g. is this like Chrome's Web Speech API, which runs on Google's own servers for free? Or would the browser user need to sign up to a provider and give their API key? The latter seems impractical.
If the "cloud fallback" runs for free on Google's own servers (for Chrome users), then that is concerning to me. The Web Speech API took the same approach, and it's unusably bad on Firefox because they don't have the Google-level resources to provide free compute to billions of people. So the Web Speech API is (or was, last I checked) basically unusable due to how bad Firefox's on-device voices are. This would be a much more severe issue with LLMs due to how compute-intensive they can be. If this API becomes important, then browser companies like Firefox would be in a rough spot, and new browser projects would be dead on arrival unless they were backed by a very large company.
Thanks all for the valuable feedback!
One misconception I'm seeing here repeatedly is people evaluating these APIs as additions to the prompt API. It'd be best for people to evaluate whether they're useful in a world where there is no prompt API. After all, the prompt API is not yet proposed to WICG or elsewhere---for good reason, as reaching interoperability on it could be quite challenging. We tried to make the relationship clear:
Even more so than many other behind-a-flag APIs, the prompt API is an experiment, designed to help us understand web developers' use cases to inform a roadmap of purpose-built APIs.
The writing assistance APIs are precisely the "purpose-built APIs" alluded to.
Of course, as @josephrocca mentions, using the behind-the-flag prompt API to help understand the requirements for robust writing assistance APIs is very valuable, and we would definitely use the feedback gathered from such experiences to make the writing assistance APIs robust.
What is the benefit of cloud-based models, and how would they be funded?
The intent here is the opposite of what you seem to be imputing. It's actually to be more inclusive, by not requiring browsers to ship their own language model. This allows more browsers to participate in implementing these APIs, not fewer.
by not requiring browsers to ship their own language model. This allows more browsers to participate in implementing these APIs
That makes sense, but I'm talking about a different concern. Rather than "(in)ability to ship a feature" I'm talking about "(in)ability to provide free cloud compute to make a potentially-very-important feature competitive with very-well-funded browser companies".
It'd be best for people to evaluate whether they're useful in a world where there is no prompt API. [...] as reaching interoperability on it could be quite challenging
Ah, I see. It's honestly hard for me (and perhaps others too given that, as you said, it is a common misconception) to imagine having only high-level "task" APIs in this area, given that a whole LLM is being downloaded to the device, and perhaps given that it seems difficult to cover much of the use-case distribution of this (very general-purpose) technology with only task-specific APIs.
That said, since the writer API "writes new material, given a writing task prompt", I think it may actually be lower-level than the Prompt API depending on some specifics (which I admit I should probably better-acquaint myself with before commenting here).
I currently exclusively use "instruct format" (akin to single-turn chat) rather than "chat format" in development, since the former is a lower-level format within which you can easily implement chat-style experiences, but with more control (e.g. the common chat formats aren't really designed for multiple non-user "characters" in the same thread). The instruct prompt is simply:
Here's a chat log:
---
[chat logs formatted however you like]
---
Please write the next message.
or something roughly to that effect (often with added role/character definitions, and sometimes extra context like when using retrieval-augmented generation techniques).
So, assuming "The writer API writes new material, given a writing task prompt" can be taken fairly literally, I think you can basically replace all mentions of "Prompt API" with "Writer API" in my earlier comment, and I'd probably advocate for releasing only the Writer API for now, and holding out of task-specific, higher-level APIs like summarization and rewriting until/if pain-points are found.
Great work from what's seen so far, looking forward to integrating these into some of my projects +1;
Hey WICG chairs! There's been some spirited discussion here, with both interest and disinterest from many quarters. I think that shows us that it's worth incubating these ideas in a forum with good IPR and contribution policies like WICG. What do you say?
While this is more an implementation than an API issue, to be consistent with the principles of an ethical web let's make sure that the models behind these APIs are open data, and that the training data has been ethically sourced - by which I mean with explicit consent for this purpose.
let's make sure that the models behind these APIs are open data, and that the training data has been ethically sourced - by which I mean with explicit consent for this purpose.
I think it should be made more explicit that this is a call for debate on fair/ethical use of publicly available data for training predictive models.
I say this because the comment appears to make assertions which would be more appropriately posed as opinions or questions, given that there are no overwhelming arguments or consensus in either direction.
I'm not sure that this is the place for such a debate, but I agree that these are important questions.
(For the record, open data/weights may be feasible, but explicit opt-in for all training data would likely make this API impossible - unless synthetic data from upstream models [with less restrictive data policies] were used.)
Thanks @cwilso for helping us complete the move to WICG! I look forward to more discussions with you all over there. https://github.com/WICG/writing-assistance-apis
Introduction
Browsers and operating systems are increasingly expected to gain access to a language model. (Example, example, example.) Web applications can benefit from using language models for a variety of use cases.
We're proposing a group of APIs that use language models to give web developers high-level assistance with writing. Specifically:
Because these APIs share underlying infrastructure and API shape, and have many cross-cutting concerns, we include them all in one explainer, to avoid repeating ourselves across three repositories. However, they are separate API proposals, and can be evaluated independently.
Read the complete Explainer.
Feedback
I welcome feedback in this thread, but encourage you to file bugs against the Explainer.