KichangKim / DeepDanbooru

AI based multi-label girl image classification system, implemented by using TensorFlow.
MIT License
2.65k stars 260 forks source link

Any benchmark for the latest released model? #61

Open ControlNet opened 2 years ago

ControlNet commented 2 years ago

Hi,

I love your project. Could you please provide some benchmark (accuracy, f1, etc) about the latest pretrained model in the release?

That will be a great help because I also want to train some models (EfficientNet, ViT, etc) by myself. By comparing the benchmark scores, I could understand the performance of the model better.

koke2c95 commented 2 years ago

we already trained nfnet, regnet, ConvNeXt on danbooru2020

those too bad, and tagger is useless .... we don't know what this trained model can do

you should checkout this paper, what's next "Transfer Learning for Pose Estimation of Illustrated Characters", Chen & Zwicker 2021

but downstream task experiment show us, too bad even not good as common trained models ....

if you want image self-sup manner, that also bad Train vision models with vissl + illustrated images

what we need VLM, text-image pretrained (like CLIP things) on LAION anime subset (remove tons anime uncorrelated datas)

training and prepare this datas, not danbooru20xx trained, is a pre-trained dataset

then we can do open vocab detector and a well captioner, enter the anime storytelling things

KichangKim commented 2 years ago

I don't have any benchmark test/score of DeepDanbooru for latest model.

ghost commented 2 years ago

@ControlNet Try this: https://github.com/lucidrains/x-clip You can use a pretrained text encoder and only train the image encoder. The only thing is that you will need to turn danbooru tags into sentences that can be interpreted by the pretrained text encoder first.

I have the same impression as @koke2c95 that tagging models does not work well on danbooru data, as the tags are too noisy and the model can not leverage the relationship between the tags(concepts), which also adds to the noise.

koke2c95 commented 2 years ago

RM

and sorry a issue mentioned on Multi-Modal-Comparators, seem it can't remove :(

ControlNet commented 2 years ago

@koke2c95 Thank you for your reply.

I decide to do simple auto-tagging (multi-label classification) with more lightweight and more accurate. So image generation or translation is not in my plan yet.

You said the label is very low quality. I fully understand it as these tags are community-driven, so the noise cannot be avoided. I'm thinking if it's possible to employ some self supervise learning and weak supervise learning techniques to improve it.

Also, Is there any tag-based anime image dataset with accurate labels?

BTW, regarding the height-width ratio of these images varies very significantly, I doubt naive resizing may not extract the features well. Using sliding window might be a better choice.

I don't have any benchmark test/score of DeepDanbooru for latest model.

Thank you for your reply.

@Daniel8811 Thank you for your suggestions. I know CLIP is an amazing work, and it's robust for unseen data. But I'm not familiar with that. If the pretrained text encoder is used, I highly doubt the anime-style labels (yuri, genshin_impact, blue_hair) can be predicted well.

ghost commented 2 years ago

@ControlNet

If the pretrained text encoder is used, I highly doubt the anime-style labels (yuri, genshin_impact, blue_hair) can be predicted well.

That could be a problem. I agree that it's still unclear if CLIP would really work better on danbooru data at the time of speaking.

@koke2c95 So I guess you are doing text2image at the moment. Do you have a thread or write-up for you exploration?

ghost commented 2 years ago

I decide to do simple auto-tagging (multi-label classification) with more lightweight and more accurate.

@ControlNet Maybe you could manually clean up the danbooru tags so that it contains less abstract concepts (like yuri) and more tags that are obvious to everyone (like blue hair). This may significantly reduce the noise and therefore make the model more accurate.

ControlNet commented 2 years ago

@Daniel8811

it contains less abstract concepts (like yuri) and more tags that are obvious to everyone (like blue hair)

Yes, it's possible although manually finding these "obvious" tags for thousands of tags are tricky.