RFC: AI Core Gem - Githubissues

Summary:

With the recent raise of generative AI models such as GPT 4, tools are emerging to plug them into existing workflows and to create new workflows that they enable. This proposal brings forth the new AI Core Gem, which is meant to help O3DE developers to utilize modern AI in games and simulations.

What is the relevance of this feature?

Given the new possibilities coming from recent advances in AI, game and simulation developers are looking to apply these new capabilities in their creations. Many steps that these developers need to take are common regardless of type of application, for example:

Connecting with AI services that are remote or local, 3rd-party or self hosted.
Communicating with multi-model, multi-modal (text, image, video, audio, ..) AI services.
User experience and interface for prompting, handling errors, iterating on tasks, and basic human-AI collaboration workflow.
Global configuration for URI, usage limits, default models, etc.
Simple visualization of AI outputs and logs.
An extendable Remote Procedure Call (RPC)-like API for AI to call, enabling whitelist approach to AI command of O3DE.
Default support for a selected 1-2 best (by some metric) open source models.
Modular design with good developer interfaces, allowing to easily add new modalities and features.
Support/abstraction for awareness of used tokens, pricing and quota (to be implemented in vendor Gems).

The AI Core Gem is quite different from Machine Learning Gem in that it focuses on Generative AI instead on multi-layer perceptrons. Considering dynamic nomenclature in the space and versatility of the Gem though, there are considerations against including "Generative" in the name.

The AI Core Gem is meant for O3DE Gems developers, and it is expected to be a dependency of future gems such as AI characters, assistants, and scene generators. Unlike these future gems, The AI Core Gem value does not strictly depend on current capabilities of AI models, as it is meant as a tool to explore their limits and is meant to be built in a flexible way, benefiting from improvements in these capabilities over years.

In the long run, from the perspective of game development, this feature can help to build smart characters that interact uniquely with the player, and assist in writing dialogue as well as creating 3D world.

From the perspective of simulation, it can help to create robots, humans (in roles such as pedestrians, warehouse workers) and animals that behave in certain ways with less scripting, build smartly randomized simulation scenes and assist users in running validation scenarios as well as summarizing their results.

Feature design description:

Connectivity and communication with Generative AI services

Generative AIs can be used through 3rd-party hosted services, such as Amazon Bedrock or Open AI's GPT platform. These typically involve a pricing model per token, depending on model type and modality. There are proprietary models and open-source models. It is also possible to host models locally, including through tools such as Ollama or vLLM.

AI services increasingly offer additional modalities (such as image prompting) as well as complex services, such as Assistants.

Since the pace of development and emerging of new APIs is rapid, it is important for the AI Core Gem to be flexible and extendable in its implementation of connectivity and communication. As such, the approach is to be:

Abstract common APIs such as model selection, text prompting and response streaming, so that these abstractions can be easily adapted into vendor-dependent APIs. Provide default (and extendable) set of RPC-like O3DE APIs for AI to interact with.
Include implementation for one selected open source model only (for example, Mixtral).
Leave vendor-specific APIs to other dependent Gems. For example, developers might build AWS AI Gem or OpenAI Gem.

The communication layer will be abstracted, allowing to support local-GPU run models in the future, as well as streaming connections such as websockets. With the first release, the feature set will rely on HttpRequestor gem for communication.

Global settings

While it makes sense to picture a use-case where more than one vendor's AI is used within one project, as they can easily have different strengths, as a first step it is good to start simple and have one global setting for the AI features, much like Physics Settings.

These global settings will include URI and other connectivity settings such as for authorization, usage limits, default models for each modality, and user preference settings for things like visualization.

The first version will only include URI and default model selection.

Global settings will be accessible through Editor menu and through registry key settings.

AI calling O3DE interfaces

Core value of this gem is to allow AI to perform work through O3DE interfaces. Examples include:

Placing and moving objects in the scene based on existing prefabs and assets, to rearrange the scene, e.g. create variants of it.
Creating new levels.
Creating new entities with components.
Interacting with Asset Processor to add generated 3D objects.
Adding a script.
Calling EBus Editor APIs of components, for example to configure them.
Calling EBus APIs of many components in game mode, for example to cause directed movement, pursue goals etc.
Requesting specific data (see #o3de-sharing-data-with-ai).

These APIs will be exposed through a kind of reflection mechanism, but full documentation needs to be also shared, either as URIs or initial prompt.

O3DE sharing data with AI

To interact with O3DE in an informed way, AI will require inputs such as:

Generic text prompts, such as definitions of tasks, context, etc. Such prompts will be first ran through a prompt engineering interface (as a bus call), which will by default leave input unmodified (no-op), but could be replaced by plugins to any of the prompt engineering tools.
API (function to call, its signature) including full documentation.
User's current viewport as image, or streamed. Other forms of tracking user's activity in O3DE might also be helpful. This is not planned for the first release though.
List of available assets, prefabs.
Various files required to accomplish tasks.
For characters, their goals, behavior specification (through text) as well as specialized directives such as Robot Constitution.
In runtime, pose and sensor data for AI-controllable characters (for future gems).
In runtime, contacts, collisions and other dynamic scene information for the purpose of understanding what is happening in the virtual world.

Runtime interaction with characters and environment might be specific to application, for example simulations are likely to expose ROS interfaces. For the first implementation, sharing list of assets, generic text prompts and callable methods is enough.

Future extensions

The gem can be extended with a voice interface, allowing to prompt and give tasks to AI by simply speaking. Note that text-to-speech is a part of some vendor APIs already, so that AI can speak back.

RFCs for new AI feature gems will follow when this gem is implemented.

Technical design description:

Challenges of AI RPC interface

To rely O3DE APIs to AI service, we need to supply information of functions' signatures, document their purpose, semantics, parameters and return values. The immediate issue is that the documentation is typically provided as code comments and itself unavailable at runtime. Such documentation is also not provided through the current behavior reflection system.

Possible solutions include:

Relaying on published documentation such as O3DE Gem API Documentation and other pages (including intranet).
Providing a processed output of Doxygen generator ran on the code base.
Supplying header files with interfaces.
Extending behavior reflection system in O3DE to work well with this use case.
Providing custom documentation when defining APIs in initial prompt or attached file.

There are various drawbacks to each of these solutions, such as ensuring good workflow when custom gems are involved, including proprietary ones, exposing code base headers to 3rd party (potentially licensing issues), blast radius of changes in O3DE, avoiding noise for AI such as APIs which are not accessible, not whitelisted or irrelevant, the issue of essentially copying information with custom documentation approach. There is also an issue of being able to register interfaces and assign callbacks dynamically. Another consideration is that AI might benefit from a custom, iterative approach to return values and how much feedback it needs to function at optimal performance, which can differ significantly from how current APIs are constructed.

One considered approach is to expand on behavior reflection system in O3DE, either through providing a way to generate AI-suitable reflection, at least for selected category, or by creating another layer in the reflection system.

The other approach considered approach is a custom API registration system which supplies function name and signature, its documentation and callback.

Community comments on RPC design are especially welcome.

AI -> O3DE interface

The API registration mechanism needs to be a part of AI Core Gem developer's interface, so that custom gems and their components can register new ways of interaction.

The AI Core Gem RPC System Component will rely static RPC (constructed through Reflection mechanism) to the AI service, and allow for dynamic attachment of callbacks to existing registry entries (otherwise callbacks are considered empty, which should cause warnings). It will also allow to dynamically add API entries (including the callback). The method of relying such API description to the AI service may be implementation-dependent, by default using text prompts, but some implementations might instead produce a file with static API description and upload it to AI Assistant-like service.

The AI service will be instructed (by internally-captured, configurable prompts) to call APIs within a specified text block, making parsing of its response straightforward. Most likely, JSON format will be used to structure the API calls in text, which is a common approach in different contexts, see libjson-rpc-cpp library as an example.

O3DE -> AI interface

Providing data to AI service will be implementation dependent. AI Core Gem will include text prompting, but might include image prompting once available in popular open source models. Other modalities might be included if standardized and implemented by most of vendors. Until then, modalities other than text will be left to vendor-specific gems.

In the context of robotics, modalities other than text will be especially important, for example images from robot camera sensors, or audio commands from its human co-workers.

What are the advantages of the feature?

Once this gem is released, O3DE developers will be empowered to develop AI-based features based on this gem. This will bring users looking to explore and develop AI applications for academic or industrial use-cases to O3DE.

What are the disadvantages of the feature?

Given that the AI space is extremely dynamic, this gem needs to be supported and updated continuously. It needs to stay relevant as the space expands and AI-empowered tools become commonplace.

There is also a considerable effort to decide which interfaces to expose and understand what is possible with the technology.

Are there any alternatives to this feature?

The main alternative is treat AI as set of external tools and to focus on developing rich API for O3DE to interact with these, as opposed to integrated approach that this proposal describes.

While integrated approach involves writing some extra wrapper code and developing O3DE-side UI/UX, the advantages lie in tailored approach to collaborative content creation and ability to work better with Editor workflows. These are main reason for preference for the integrated approach.

Another alternative is not to have the AI Core Gem, but instead one gem per vendor, including open source. However, this has disadvantages of repeating the common part and it doesn't help to have the same UX for AI users in O3DE, where a common use-case will be to compare performance of AI from several vendors.

How will users learn this feature?

The Gem will be a part of canonical set, documented and cross-referenced in O3DE documentation. Publicity for the Gem is also planned, and showcase demo will be released in 2024. The gem will likely be presented alongside other AI gem(s), as it focused on core functionalities rather than user-facing features.

Are there any open questions?

How to best serve the documentation of API to avoid repetitions (with Doxygen comments) as well as respect versioning and closed source Gems?
Which use cases are possible with current performance of generative AI?
How will developers and users of dependent AI gems be able to determine whether performance is good enough for their application?
How to best protect against incurring of unreasonable costs possible due to mistakes made by developer in automated jobs that use of AI (such as infinite loops)? Should this be left to 3rd-party service account configuration? Token limits per automated job?
What is the best security model to apply?
How to best handle long wait times for some of AI work / response?
How to best handle workflow with multiple AI agents?
What is the best way to relay documentation of APIs to AI which is supposed to call them?
Should we use existing Behavior Reflection mechanism, which will require changes, or a more tailored approach?

RobotecAI / o3de-genai-gems

RFC: AI Core Gem #1