[Feat]: Implement Florence-2 as image captioning model

Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.

GNU Affero General Public License v3.0

1.57k stars 127 forks source link

Describe your use-case.

I recently started using Florence-2 (https://huggingface.co/microsoft/Florence-2-base) in ComfyUI to caption images and was blown away by the lightweight model's speed and accuracy and versatility. It is so much better than BLIP and it can be used as a DocQVA (https://huggingface.co/HuggingFaceM4/Florence-2-DocVQA) version too. I tested both large and base, finetune and base version and found that for captioning images, the smallest base model seemed to not perform worse than the larger or finetuned models, YMMV. While DocVQA seemed to work well enough in most cases, it would require several passed to ask for all the details, putting all questions in one prompt did not work super well, still would be nice to have.

What would you like to see as a solution?

Due to the amazing accuracy and performance, I would love to be able to use Florence-2 as image captioning model in Onetrainer, with at least its presets "caption image" "detailed caption" and "more detailed caption" (as the comfyui node ( https://github.com/kijai/ComfyUI-Florence2 ) implementation does it).

If feasible to implement, this could also be used to auto-create masks in the dataset tools' "mask-auto-creation" area. From my testing it seems to mask features you prompt for really well, and real fast too.

Have you considered alternatives? List them here.

sure, but this is imho the best locally-run visual model right now.

As a user of OT ( I do not use comfy nor Florence), I have a few questions to about your request that I would like to clarify:

Use-Case Specifics: How exactly would integrating Florence-2 into Onetrainer improve your training workflow? Onetrainer is mainly for training, not tagging. What issues are you facing with Florence-2 in ComfyUI that you think Onetrainer could solve?
Alternatives Exploration: Have you tried other tools like DatasetHelpers and TagGUI? TagGUI already supports Florence 2. What are these tools missing that Onetrainer could provide?
Performance Metrics: Can you share any data or benchmarks comparing Florence-2’s performance in captioning a diverse range of pictures (this includes NSFW) versus other captioning/tagging models? This would help us see if it’s really worth it.
Purpose & Implementation Feasibility: Onetrainer’s image and mask captioning features are pretty basic and not the main focus atm. Your request sounds a bit like implementing an alternative to GroundedDINO.

Additionally being even more specific what UI changes changes do you think are needed to get your description of Florence-2 working in Onetrainer? The GUI is already a very bad state and needs a significant work (See "sharp corners" issue)

Nerogar / OneTrainer