[Feature Request]: add blip2 model to "Preprocess images".

goblin776655 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

It will use blip2 models for text desc of images 222902203-a53ef1b7-d9c3-4445-a0fe-751d3624943f

Proposed workflow

Go to Train
Press Preprocess images
Chose Use BLIP for caption.
Add here Use BLIP2 for caption.

Additional information

No response

oaefou commented 1 year ago

Great idea !!!

ClashSAN commented 1 year ago

Use one of these extensions. If you need to to unload the previous model, you can use supermerger's unload model button. https://github.com/p1atdev/stable-diffusion-webui-blip2-captioner https://github.com/Tps-F/sd-webui-blip2

ArjunDevSingla commented 1 year ago

I was working on this, but I am unable to understand some of the code structure. But I will figure it out, and try to add this feature. But if it is possible, can you give me a brief overview of the overall structure.

slavakurilyak commented 1 year ago

@ArjunDevSingla I hope this helps 👇

Here's a brief summary of @p1atdev stable-diffusion-webui-blip2-captioner/blip2.py:

This Python code defines a class BLIP2 which is used to generate captions for images using a pre-trained model. The code uses the PyTorch library and relies on a separate module called lavis.models.

Import the required libraries - torch, typing, PIL.Image, and lavis.models.
Define the BLIP2 class with an __init__ method that takes a model_type argument: a. Determine the device (GPU or CPU) for running the model based on the availability of CUDA. b. Load the pre-trained model and preprocessors using the load_model_and_preprocess function from lavis.models.
Define a generate_caption method for the BLIP2 class with several parameters, including the input image and options for controlling the caption generation process: a. Preprocess the input image using the visual preprocessor and move it to the appropriate device (GPU or CPU). b. Generate captions using the pre-trained model and the given parameters for beam search, nucleus sampling, maximum and minimum caption length, and repetition penalty. c. Return the generated captions.
Define an unload method for the BLIP2 class to free up memory by deleting the model, preprocessors, and clearing the GPU cache.

The code provides an interface for loading a pre-trained model, generating captions for images, and then unloading the model to free up resources.

Here is a brief summary of @p1atdev stable-diffusion-webui-blip2-captioner/scripts/main.py

This script is a Python program for generating captions for images using BLIP2. It provides both single-image captioning and batch-image captioning functionalities. The program uses the Gradio library to create a user interface for easy interaction.

Import necessary libraries and modules, such as os, pathlib, torch, gradio, and PIL.
Set ImageFile.LOAD_TRUNCATED_IMAGES to True to allow loading of truncated images.
Import script_callbacks from the modules package.
Import the BLIP2 class from the blip2 module.
Create an empty dictionary called captioners to store loaded models.
Define a list called model_list containing the names of available models ("coco" and "pretrain").
Define a list called sampling_methods containing the names of available sampling methods ("Nucleus" and "Top-K").
Define a function model_check that checks if a model is already loaded or not, and loads the model if it's not in the captioners dictionary.
Define a function unload_models that unloads all the models in the captioners dictionary and clears GPU cache.
Define a function generate_caption that takes an image and various caption generation parameters, and returns a generated caption for the image.
Define a function generate_caption_for_single_image that takes an image and caption generation parameters, and returns a caption for the image.
Define a function create_caption_file that takes a caption and an output file path, and writes the caption to a file at the specified path.
Define a function batch_captioning that takes input and output directories, caption file extension, and caption generation parameters, and generates captions for all the images in the input directory, saving them to the output directory.
Define a function on_ui_tabs that creates the Gradio user interface with two tabs: "Single" for single image captioning and "Batch" for batch image captioning. The interface includes various input elements, such as image upload, text boxes, dropdowns, sliders, and buttons.
Register the on_ui_tabs function with the script_callbacks module using the on_ui_tabs method.

AUTOMATIC1111 / stable-diffusion-webui