danilop / multimodal-chat

A multimodal chat interface with many tools.
MIT License
96 stars 11 forks source link
anthropic anthropic-claude aws aws-bedrock chat-application chat-interface chatbot chatbot-application gradio gradio-python-app gradio-python-llm llm multimodal multimodal-large-language-models tool-use

Yet Another Intelligent Assistant (YAIA)

A multimodal chat interface with access to many tools.

Description

YAIA is a sophisticated multimodal chat interface powered by advanced AI models and equipped with a variety of tools. It can:

Architecture

These are the main components:

Examples

Here are examples of how to use various tools:

  1. Web Search: "Search the web for recent advancements in quantum computing."

  2. Wikipedia: "Find Wikipedia articles about the history of artificial intelligence."

  3. Python Scripting: "Create a Python script to generate a bar chart of global CO2 emissions by country."

  4. Sketchbook: "Start a new sketchbook and write an introduction about how to compute Pi with numerical methods."

  5. Image Generation: "Generate an image of a futuristic city with flying cars and tall skyscrapers."

  6. Image Search: "Search the image catalog for pictures of endangered species."

  7. arXiv Integration: "Search for recent research papers on deep learning in natural language processing."

  8. Conversation Generation: "Create a conversation between three experts discussing how to set up multimodal RAG."

  9. File Management: "Save a summary of our discussion about climate change to a file named 'climate_change_summary.txt'."

  10. Personal Improvement: "Here's a suggestion to improve: to improve answers, search for official sources."

  11. Checklist: "Start a new checklist to follow a list of tasks one by one."

Key Features and Tools

  1. Web Interaction:

    • DuckDuckGo Text Search: Performs web searches
    • DuckDuckGo News Search: Searches for recent news articles
    • DuckDuckGo Maps Search: Searches for locations and businesses
    • DuckDuckGo Images Search: Searches for publicly available images
    • Web Browser: Browses websites and retrieves their content
  2. Wikipedia Tools:

    • Wikipedia Search: Finds relevant Wikipedia pages
    • Wikipedia Geodata Search: Locates Wikipedia articles by geographic location
    • Wikipedia Page Retriever: Fetches full Wikipedia page content
  3. Python Scripting:

    • Runs Python scripts for computations, testing, and output generation, including text and images
    • Python modules can be added to the Python interpreter
    • Python code is run in a secure environment provided by AWS Lambda
  4. Content Management:

    • Personal Archive: Stores and retrieves text, Markdown, or HTML content, using a semantic database
    • Sketchbook: Manages a multi-page sketchbook for writing and reviewing long-form content. Supports multiple output formats:
      • Markdown (.md): For easy reading and editing
      • Word Document (.docx): For document editing
  5. Image Handling:

    • Image Generation: Creates images based on text prompts
    • Image Catalog Search: Searches images by description
    • Image Similarity Search: Finds similar images based on a reference image
    • Random Images: Retrieves random images from the catalog
    • Get Image by ID: Retrieves a specific image from the catalog using its ID
    • Image Catalog Count: Returns the total number of images in the catalog
    • Download Image: Adds images from URLs to the catalog
  6. arXiv Integration:

    • Search and download arXiv papers
    • Store paper content in the archive for easy retrieval
  7. Conversation Generation:

    • Transform content into a conversation between two to four people
    • Generate audio files for the conversation using text-to-speech
  8. File Management:

    • Save File: Allows saving text content to a file with a specified name in the output directory
  9. Personal Improvement:

    • Track suggestions and mistakes for future enhancements
  10. Checklist:

    • Manage task lists with the ability to add items, mark them as completed, and review progress

For a comprehensive list of available tools and their usage, refer to ./Config/tools.json.

Requirements

  1. A container tool: Docker or Finch (to install Finch, follow the instructions here)
  2. Python 3.12 or newer
  3. AWS account with appropriate permissions to access Amazon Bedrock, AWS Lambda, and Amazon ECR

Installation

  1. Clone the repository:

    git clone https://github.com/danilop/multimodal-chat
    cd multimodal-chat
  2. Create and activate a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
  3. Install the required packages:

    pip install -r requirements.txt
  4. Set up the AWS Lambda function for code execution:

    cd LambdaFunction
    ./deploy_lambda_function.sh
    cd ..
  5. To use Selenium for web browsing, install ChromeDriver. Using Homebrew:

    brew install --cask chromedriver
  6. To output audio, install ffmpeg. Using Homebrew:

    brew install ffmpeg

Setting up OpenSearch

You can either use a local OpenSearch instance or connect to a remote server. For local setup:

  1. Navigate to the OpenSearch directory:

    cd OpenSearch/
  2. Set the admin password (first-time setup), this step will create the .env file and the opensearch_env.sh files:

    ./set_password.sh
  3. Start OpenSearch locally (it needs access to the .env file):

    ./opensearch_start.sh
  4. Ensure OpenSearch (2 nodes + dashboard) starts correctly by checking the output

  5. To update OpenSearch, download the new container images using this script:

    ./opensearch_update.sh

For remote server setup, update the client creation code in the main script.

To change password, you need to delete the container uisng finch or docker and then set a new password.

Usage

Default models for text, images, and embeddings are in the Config/config.ini file. The models to use are specified using Amazon Bedrock model IDs or cross-region inference profile IDs. You need permissions and access to these models as described in Access foundation models.

This section assumes OpenSearch is running locally in another terminal window as described before.

  1. Load the OpenSearch admin password into the environment:

    source OpenSearch/opensearch_env.sh
  2. Run the application:

    python multimodal_chat.py
  3. To reset the text and multimodal indexes (note: this doesn't delete images in ./Images/):

    python multimodal_chat.py --reset-index
  4. Open a web browser and navigate to http://127.0.0.1:7860/ to start chatting.

Demo videos

Here are a few examples of what you can do this application.

Browse the internet and use the semantic archive

In this demo:

Multimodal Chat Demo 1 – Browse the internet and use the semantic archive

Import and search Images

In this demo:

Multimodal Chat Demo 2 – Import and search images

Generate and search images

In this demo:

Multimodal Chat Demo 3 – Generate and search images

Python code interpreter

In this demo:

Multimodal Chat Demo 4 – Python code interpreter

Writing on a "sketchbook"

In this demo:

Multimodal Chat Demo 5 – Writing on a "sketchbook"

Sketchbook with a Python code review

In this demo:

Multimodal Chat Demo 6 – Sketchbook with a Python code review

Troubleshooting

Contributing

Contributions to YAIA are welcome! Please refer to the contributing guidelines for more information on how to submit pull requests, report issues, or request features.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Usage Tips

For more detailed information on specific components or advanced usage, please refer to the inline documentation in the source code.