Clean-UI for Multi-Modal Vision Models

This project offers a user-friendly interface for working with the Llama-3.2-11B-Vision and Molmo-7B-D models.

In this case, both the Llama-3.2-11B-Vision-bnb-4bit and Molmo-7B-D-bnb-4bit models need 12GB of VRAM to run.

The model selection is done via the command line:

Installation

To set up and run this project on your local machine, follow the steps below:

Copy the repository to a convenient location on your computer:

git clone <repository-url>
cd <repository-directory>

Inside the cloned repository, create a virtual environment using the following command:

python -m venv venv-ui

Activate the virtual environment using:

  .\venv-ui\Scripts\activate

After activating the virtual environment, install the necessary dependencies from requirements.txt:

pip install -r requirements.txt

Install Torch and TorchVision using separate commands:

pip install torch==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121

and

pip install torchvision==0.19.1+cu121 --index-url https://download.pytorch.org/whl/cu121

To start the UI, you can either:

Upload an image and enter a prompt to generate an image description.
Adjustable parameters such as temperature, top-k, and top-p for more control over the generated text.
Chatbot history to display prompt-response interactions.

This project is licensed under the MIT License. See the LICENSE file for more details.