LizzyEw commented 1 week ago

Hi there! Thank you for your great job! When I made captions, I found that MAX_NEW_TOKENS only works for truncating sentences, but not for generating concise results. Also, I limited the number of output words in VLM_PROMPT, it didn't work either. The other question is about Chinese characters. I found it can generate Chinese phrases in huggingface space when there are Chinese characters in image. But in this repo, it translates Chinese to English and outputs all in English. I would be grateful if you can reply.

MNeMoNiCuZ commented 1 week ago

Hi. Thanks for your comments.

MAX_NEW_TOKENS

This does appear to work well for me. As you say, it truncates the sentences, but doesn't generate concise results. I'm not sure how to go about that.

When using max 300 tokens: This image is a digital graphic featuring a bold, black Chinese character against a stark white background. The character is rendered in a sans-serif font, with thick, blocky strokes that give it a modern and minimalist appearance. The character, which is the Chinese word for "China," is centered and occupies the entire width of the image. The background is plain white, devoid of any additional elements or textures, ensuring that the focus remains entirely on the Chinese character. The simplicity of the design and the use of contrasting colors create a clean and professional look. The overall style is modern and minimalist, typical of digital graphic design. The image does not include any additional objects, people, or scenery, keeping the focus solely on the Chinese character. Actual token count: 147

When using max 50 tokens: The image is a digital graphic featuring a large, bold, black Chinese character set against a plain white background. The character, rendered in a modern, sans-serif font, is composed of thick, straight lines that intersect and overlap, creating a sense of Actual token count: 50

When using max 10 tokens: This image is a digital graphic featuring a large, Actual token count: 10

When using max 5 tokens: The image is a digital Actual token count: 5

Remember, this is a maximum token count, and the results may not work well with too strict limitations. I think you will need to play with this value, as well as the Temperature value, to control the types of output you get. You may also want to integrate a repetition penalty to control it further.

VLM_PROMPT

As mentioned in the code, changing the "VLM_PROMPT" has no effect. The model is not trained to reply to prompts, it will always only output captions as it's trained.

To prompt your VLM, my current recommendation is Moondream2 or Qwen2.

Chinese Characters & Text

Regarding the Chinese characters, could you give me some materials to test with? I tried using this image: https://ocw.mit.edu/courses/21g-109-chinese-iii-streamlined-fall-2005/09c644641f11213634a5daa907d60502_21g-109f05.jpg

And I get reasonably similar results in the online implementation in Huggingface as I get with this local version:

They are not exactly the same of course, but it seems like it's doing a similar job.

So I need to know exactly what you are doing and how it's not working. When testing with Qwen2 and asking it to caption any text, I also don't get anything. It only works a little bit with English text.

LizzyEw commented 1 week ago

Thanks for your detailed explanation!!! Now I get that how MAX_NEW_TOKENS works and why VLM_PROMPT has no effect. Thanks for your advice! I'll try Qwen2 later. As for the Chinese characters, more tests were conducted in huggingface space. It turns that just in a few cases did Chinese characters appear and characters were not that correct. So it could just be a random event. Thank you again for your detailed reply one by one and also thanks for your great work❤️

MNeMoNiCuZ / joy-caption-batch

Questions about token length and Chinese characters #12

MAX_NEW_TOKENS

VLM_PROMPT

Chinese Characters & Text