DEVAIEXP / image-interrogator

The IMAGE-interrogator for SOTA image captioning
MIT License
77 stars 6 forks source link

IMAGE-interrogator

Want to improve your SOTA image captioning experience for Stable Diffusion? The IMAGE Interrogator is here to make this job easier!

About

The IMAGE Interrogator is a variant of the original CLIP Interrogator tool that brings all original features and adds other large models like LLaVa and CogVml for SOTA image captioning. Then you can train with fine-tuning on your datasets or use resulting prompts with text-to-image models like Stable Diffusion on DreamStudio to create cool art!

⏰ Update

Installation

Use Python version 3.10.* and have the Python virtual environment installed. For linux users will need to install Tkinter with the follow command:

sudo apt-get install python3-tk

Then run the following commands in the terminal:

git clone https://github.com/DEVAIEXP/image-interrogator.git
cd image-interrogator
(for linux  ) source install_linux.sh
(for windows) install_windows.bat

Aditional parameters for installation scripts:

Assuming T for True and F for False:

Running with customization

The start.sh and start.bat scripts trigger the image-interrogator.py script via Python. The python script allows you to enter some parameters:

Note: The linux script is configured to run on WSL 2. If you are running on a linux installation you will need to adjust the LD_LIBRARY_PATH variable in the file with the correct path of your CUDA Toolkit.

IMAGE Interrogator support 4-bit quantization and 8-bit quantization (except for CogVLM and CogAgent only 4-bit quantization is enabled) for low memory usage. Precision type parameters have also been added to the interface such as FP16 and BF16. On systems with low VRAM you can try 4-bit quantization or check Optimize settings for low VRAM in Load options from interface. It will reduce the amount of VRAM needed (at the cost of some speed and quality).

Interface parameters

Prompt tab

The Generate options lets you enable/disable resources that may be generated during or after the prompt is generated.

Caption tab

In this tab you can choose your preferred model from the list for generating the caption. For some models like LLaVa some additional parameters are available such as: temperature and top p. LLaVa, CogAgent, CogVLM and Kosmos-2 allow you to use question prompts for generation. We automatically suggest question prompts when the template is selected, but you can change the prompt text to ask what you want about an image for the template. Look for more information about how to generate prompts in these models directly on the official page of the chosen model.

Features tab

For selection of OpenCLIP pretrained CLIP Model. Only one feature mode can selected at a time.

Action buttons

Analyze tab

It returns a list of words in each feature and their scores for the given image, model.

Others

If you update the version of LLaVa, Gradio and PIL dependencies, this tool will not work correctly. When there is a need to update these dependencies, we will update them in our repository. Whenever there is a new update to this repository, it will be necessary to delete the 'repositories' directory and run the installation script again.

License

This project is released under the MIT license.

Acknowledgement

This project is based on CLIP Interrogator, Some codes are brought from LLaVa. Thanks for their awesome works.

Contact

If you have any questions, please contact: contact@devaiexp.com