Open ControlNet opened 2 years ago
we already trained nfnet, regnet, ConvNeXt on danbooru2020
those too bad, and tagger is useless .... we don't know what this trained model can do
you should checkout this paper, what's next "Transfer Learning for Pose Estimation of Illustrated Characters", Chen & Zwicker 2021
but downstream task experiment show us, too bad even not good as common trained models ....
if you want image self-sup manner, that also bad Train vision models with vissl + illustrated images
what we need VLM, text-image pretrained (like CLIP things) on LAION anime subset (remove tons anime uncorrelated datas)
training and prepare this datas, not danbooru20xx trained, is a pre-trained dataset
then we can do open vocab detector and a well captioner, enter the anime storytelling things
I don't have any benchmark test/score of DeepDanbooru for latest model.
@ControlNet Try this: https://github.com/lucidrains/x-clip You can use a pretrained text encoder and only train the image encoder. The only thing is that you will need to turn danbooru tags into sentences that can be interpreted by the pretrained text encoder first.
I have the same impression as @koke2c95 that tagging models does not work well on danbooru data, as the tags are too noisy and the model can not leverage the relationship between the tags(concepts), which also adds to the noise.
RM
and sorry a issue mentioned on Multi-Modal-Comparators, seem it can't remove :(
@koke2c95 Thank you for your reply.
I decide to do simple auto-tagging (multi-label classification) with more lightweight and more accurate. So image generation or translation is not in my plan yet.
You said the label is very low quality. I fully understand it as these tags are community-driven, so the noise cannot be avoided. I'm thinking if it's possible to employ some self supervise learning and weak supervise learning techniques to improve it.
Also, Is there any tag-based anime image dataset with accurate labels?
BTW, regarding the height-width ratio of these images varies very significantly, I doubt naive resizing may not extract the features well. Using sliding window might be a better choice.
I don't have any benchmark test/score of DeepDanbooru for latest model.
Thank you for your reply.
@Daniel8811 Thank you for your suggestions. I know CLIP is an amazing work, and it's robust for unseen data. But I'm not familiar with that. If the pretrained text encoder is used, I highly doubt the anime-style labels (yuri, genshin_impact, blue_hair) can be predicted well.
@ControlNet
If the pretrained text encoder is used, I highly doubt the anime-style labels (yuri, genshin_impact, blue_hair) can be predicted well.
That could be a problem. I agree that it's still unclear if CLIP would really work better on danbooru data at the time of speaking.
@koke2c95 So I guess you are doing text2image at the moment. Do you have a thread or write-up for you exploration?
I decide to do simple auto-tagging (multi-label classification) with more lightweight and more accurate.
@ControlNet Maybe you could manually clean up the danbooru tags so that it contains less abstract concepts (like yuri) and more tags that are obvious to everyone (like blue hair). This may significantly reduce the noise and therefore make the model more accurate.
@Daniel8811
it contains less abstract concepts (like yuri) and more tags that are obvious to everyone (like blue hair)
Yes, it's possible although manually finding these "obvious" tags for thousands of tags are tricky.
Hi,
I love your project. Could you please provide some benchmark (accuracy, f1, etc) about the latest pretrained model in the release?
That will be a great help because I also want to train some models (EfficientNet, ViT, etc) by myself. By comparing the benchmark scores, I could understand the performance of the model better.