JoyCaption is an open, free, and uncensored captioning Visual Language Model (VLM).
Try the Demo on HuggingFace | Download the Current Model on Hugging Face | Latest Release Post
JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
Key Features:
Automated descriptive captions enable the training and finetuning of diffusion models on a wider range of images, since trainers are no longer required to either find images with already associated text or write the descriptions themselves. They also improve the quality of generations produced by Text-to-Image models trained on them (ref: DALL-E 3 paper). But to-date, the community has been stuck with ChatGPT, which is expensive and heavily censored; or alternative models, like CogVLM, which are weaker than ChatGPT and have abysmal performance outside of the SFW domain.
I'm building JoyCaption to help fill this gap by performing near or on-par with GPT4o in captioning images, while being free, unrestricted, and open.
To see JoyCaption in action, check out the demo on HuggingFace Spaces.
To use JoyCaption locally, you can download the model from Hugging Face and integrate it into your existing workflows.
NOTE: This example is a bit verbose because the current release of JoyCaption does not have a transformers
Processor
configured yet, so the preprocessing has to be done manually. Sorry!
import torch
import torch.amp
import torchvision.transforms.functional as TVF
from PIL import Image
from transformers import AutoTokenizer, LlavaForConditionalGeneration
IMAGE_PATH = "image.jpg"
PROMPT = "Write a long descriptive caption for this image in a formal tone."
MODEL_NAME = "fancyfeast/llama-joycaption-alpha-two-hf-llava"
# Load JoyCaption
# bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)
# device_map=0 loads the model into the first GPU
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
llava_model.eval()
with torch.no_grad():
# Load and preprocess image
# Normally you would use the Processor here, but the image module's processor
# has some buggy behavior and a simple resize in Pillow yields higher quality results
image = Image.open(IMAGE_PATH)
if image.size != (384, 384):
image = image.resize((384, 384), Image.LANCZOS)
image = image.convert("RGB")
pixel_values = TVF.pil_to_tensor(image)
# Normalize the image
pixel_values = pixel_values / 255.0
pixel_values = TVF.normalize(pixel_values, [0.5], [0.5])
pixel_values = pixel_values.to(torch.bfloat16).unsqueeze(0)
# Build the conversation
convo = [
{
"role": "system",
"content": "You are a helpful image captioner.",
},
{
"role": "user",
"content": PROMPT,
},
]
# Format the conversation
convo_string = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=True)
# Tokenize the conversation
convo_tokens = tokenizer.encode(convo_string, add_special_tokens=False, truncation=False)
# Repeat the image tokens
input_tokens = []
for token in convo_tokens:
if token == llava_model.config.image_token_index:
input_tokens.extend([llava_model.config.image_token_index] * llava_model.config.image_seq_length)
else:
input_tokens.append(token)
input_ids = torch.tensor(input_tokens, dtype=torch.long).unsqueeze(0)
attention_mask = torch.ones_like(input_ids)
# Generate the caption
generate_ids = llava_model.generate(input_ids=input_ids.to('cuda'), pixel_values=pixel_values.to('cuda'), attention_mask=attention_mask.to('cuda'), max_new_tokens=300, do_sample=True, suppress_tokens=None, use_cache=True)[0]
# Trim off the prompt
generate_ids = generate_ids[input_ids.shape[1]:]
# Decode the caption
caption = tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
caption = caption.strip()
print(caption)
JoyCaption Alpha Two offers multiple modes of caption generation to suit different needs. Descriptive Caption
prompting is the most useful, with the other modes being experimental. The HuggingFace demo has a nice interface for selecting the output mode and extra options, and it outputs the prompt it used. Otherwise, here are all the prompts that JoyCaption Alpha Two understands:
Descriptive Caption: Writes descriptive captions for the image, either in a formal or informal tone.
Training Prompt: Writes more like the average Stable Diffusion prompt, with a mixture of natural language and booru-like tags, mimicing what users might prompt SD to get the image.
MidJourney: Similar to Training Prompt mode but more like MidJourney prompts.
Booru Tag List: Writes a list of Booru-style tags for the image.
Booru-Like Tag List: Similar to Booru Tag List mode, but will write outside the strict list of tags that boorus use.
Art Critic Analysis: Writes an analysis of the image like an art critic.
Product Listing: Writes a product listing-style caption for the image.
Social Media Post: Writes a caption for the image suitable for a social media post.
The following extra instructions can be appended to the prompt to guide the caption generation:
WARNING: Alpha Two was heavily trained on the above Prompts and Extra Options. It is not a general instruction follower. Feel free to experiment outside of these prompts, but don't expect great results (yet).
JoyCaption is currently at Alpha Two. This means that it is still under development, and improvements are continuously being made based on feedback from users.
Please note that JoyCaption is not yet ready for production use. It's an experimental release, and you may encounter mistakes, especially when it comes to interactions between characters in an image, OCR, and confusing left/right when describing objects and actions in releation to people.
Feedback is always welcome and crucial to helping me improve JoyCaption for everyone to use! If you have suggestions for improvement, notice weaknesses, or want to contribute to the project, please reach out.