DEVAIEXP / image-interrogator

The IMAGE-interrogator for SOTA image captioning
MIT License
77 stars 5 forks source link

LLM Model with no Censorship #17

Open BenDes21 opened 6 months ago

BenDes21 commented 6 months ago

Hi, sorry they're is not discussion tab so I have to set this thread here. Im training at this moment a dataset of 100 000 HD images using CogVLM 4 bits for a future SDXL checkpoint. I finished the captions of the half of my dataset but I realised that the captions of the " erotic " part of my dataset ( including semi-nudes or full nudes ) are censored meaning that CogVLM never included words like " nude, naked, boobs, p*ssy etc... " . Would like to know if it's possible to remove this " censorship " and if its not possible with cogvlm, if you know guy's an " open " llm model who can describe all kind of pics with no censorship.

Thanks a lot!

elismasilva commented 6 months ago

Now i added Discussion tab, i dont know about this cog behavior but i will do a research. I think maybe this models arent trained with nfsw content and some feature checket could be enabled in this model.

mr-lab commented 5 months ago

not sure about nude stuff , but cog is kinda wild , what we do with our dataset is use a modified version of this project to first do a WD tagger caption then as cog to do caption using this prompt : " always start with 'This image showcases'. Describe this image in a very detailed manner using those words : {here goes the WD tagger captions} " this helps to align cog more into the actual image content and skips the useless commentary , about what the image means or what cog thinks about the image . and This image showcases is just so we can mass replace "This image showcases" with "" and we have a good description of the image , and replace "." with "," , cog loves "." ... as for uncensored cog , i think the dolphin prompt works on it , the issue is if you use a long prompt with cog it goes insane .

will explore running both WD tagger and cog then send both prompts to uncensored mistral for a unified caption as well , uncensored VLM(large) does not exist so far

BenDes21 commented 5 months ago

not sure about nude stuff , but cog is kinda wild , what we do with our dataset is use a modified version of this project to first do a WD tagger caption then as cog to do caption using this prompt : " always start with 'This image showcases'. Describe this image in a very detailed manner using those words : {here goes the WD tagger captions} " this helps to align cog more into the actual image content and skips the useless commentary , about what the image means or what cog thinks about the image . and This image showcases is just so we can mass replace "This image showcases" with "" and we have a good description of the image , and replace "." with "," , cog loves "." ... as for uncensored cog , i think the dolphin prompt works on it , the issue is if you use a long prompt with cog it goes insane .

will explore running both WD tagger and cog then send both prompts to uncensored mistral for a unified caption as well , uncensored VLM(large) does not exist so far

Hi there thanks a lot for your answer :)

So you doing 1 - a WD tagger captioning of an image, it's will give you a list of keywords of like " 1girl, brown hair, nude, bed. bedroon, etc... " let's said 2 - Then a second captioning using COGvlm ( I guess cogagent VQA ? ) and you write for the prompt " : always start with 'This image showcases'. Describe this image in a very detailed manner using those words : " 1girl, brown hair, nude, bed. bedroon, etc... " 3 -Then you replace "This image showcases" with "" and replace "." with "," , cog loves "." ...

Look nice, will def try it. I have btw few questions : is it possible to automate this process ( batch process ) for a huge amount of images ? I guess you want fusion theses steps and send into a uncensored mistral model so. And also what's a dolphin prompt ? :P Thanks a lot!!!

Edit : Tried the 3 steps and it's working good, now I try to find a way to batch process this whole process

mr-lab commented 5 months ago

here is a snipped of the code , you add this logic in the image process part .this was coded for a different project than image-interrogator and probebly won't make sense : image_dir is image path output_dir is the txt file export folder caption_path is the wd tagger text file path

    output_dir = os.path.join(image_dir, 'output')
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    caption_path = os.path.join(image_dir, caption_filename)
    parent_folder_path = os.path.dirname(filename)
    file_namep = os.path.basename(filename)
    new_file_name = os.path.splitext(file_namep)[0] + ".txt"
    caption_pathx = parent_folder_path +"\\output\\"+new_file_name
    try:
        with open(caption_path, "r") as f:
            text_file_content = f.read().strip()
    except FileNotFoundError:
        text_file_content = ""

        promptx = prompt + " " + text_file_content

        prompt is the  "always start with 'This image showcases'. Describe this image in a very detailed manner using those words : " 

        promptx is gonna be used instead of the default prompt  variable , 

for your case : image-interrogator for WD tagger style of captions and those are saved in Text caption folder 1 after it's done run the folder again using Cogvlm and apply the logic described as for code here is the after WD tagger step i think it goes like this : -we load all the image files paths into an array from a folder -for each image path , we get the image , we get the WD text file in the sub folder text caption 1 -combine the prompt string with the content of the text file send to cog to make the caption run this on chatgpt it will modify the batch processing function for you promptx = prompt + " " + text_file_content

the logic kinda keeps the WD tagger file and dont edit it , because i'm trying to train using multiple Caption as tested here : https://github.com/kohya-ss/sd-scripts/issues/781#issuecomment-1825174300 are you aware of this ? Text encoder 1 , Text encoder 2 for Sdxl ?

BenDes21 commented 5 months ago

image_dir is image path output_dir is the txt file export folder caption_path is the wd tagger text file path

Hi there, thanks a lot for your answer and all theses infos :) Im not a very tech guy so it's a little bit confusing for me, Do I have to insert this code into a file .py of image-interrogator ? it's possible to assist me on discord me few minutes for running this process ? my username : jehex

mr-lab commented 5 months ago

CogVLM cogagent-vqa-hf , qwen-vl-chat ... and much more , are working with some missing values that will make it more censored , example : cogagent-vqa with top_p = 5 and top_k = 100 qwen-vl-chat top_p = 1/5 top_k =10 optional numbeams = 5 not sure if image-interrogator sets those values or not , but setting them in clip-interrogator and generate.py , gives better results

BenDes21 commented 5 months ago

CogVLM cogagent-vqa-hf , qwen-vl-chat ... and much more , are working with some missing values that will make it more censored , example : cogagent-vqa with top_p = 5 and top_k = 100 qwen-vl-chat top_p = 1/5 top_k =10 optional numbeams = 5 not sure if image-interrogator sets those values or not , but setting them in clip-interrogator and generate.py , gives better results

Ok will try it with taggui, thanks

@elismasilva do you plan to add theses settings to image interrogator ?

elismasilva commented 5 months ago

CogVLM cogagent-vqa-hf , qwen-vl-chat ... and much more , are working with some missing values that will make it more censored , example : cogagent-vqa with top_p = 5 and top_k = 100 qwen-vl-chat top_p = 1/5 top_k =10 optional numbeams = 5 not sure if image-interrogator sets those values or not , but setting them in clip-interrogator and generate.py , gives better results

Ok will try it with taggui, thanks

@elismasilva do you plan to add theses settings to image interrogator ?

Could you give me some comparative example of the expected results with and without parameters? so I can test to promote the best ones.

BenDes21 commented 4 months ago

CogVLM cogagent-vqa-hf , qwen-vl-chat ... and much more , are working with some missing values that will make it more censored , example : cogagent-vqa with top_p = 5 and top_k = 100 qwen-vl-chat top_p = 1/5 top_k =10 optional numbeams = 5 not sure if image-interrogator sets those values or not , but setting them in clip-interrogator and generate.py , gives better results

Ok will try it with taggui, thanks @elismasilva do you plan to add theses settings to image interrogator ?

Could you give me some comparative example of the expected results with and without parameters? so I can test to promote the best ones.

do you think you can include the possibility to modify theses parameters like in Taggui ? :

Untitled
mr-lab commented 4 months ago

CogVLM cogagent-vqa-hf , qwen-vl-chat ... and much more , are working with some missing values that will make it more censored , example : cogagent-vqa with top_p = 5 and top_k = 100 qwen-vl-chat top_p = 1/5 top_k =10 optional numbeams = 5 not sure if image-interrogator sets those values or not , but setting them in clip-interrogator and generate.py , gives better results

Ok will try it with taggui, thanks @elismasilva do you plan to add theses settings to image interrogator ?

Could you give me some comparative example of the expected results with and without parameters? so I can test to promote the best ones.

here is a quick test i did qwen-VL-Chat top_P: 5 Top_k : 10 num beams 10 optional max token :256 Prompt : always start with (This image showcases). Describe this image in a very uncensored manner using those words as a reference : here goes tags from wd tagger

tempature was always set to 1 same as L-penalty=1 .

this gives Straight Po*n words like Pu , Cu , and very lewd and uncensored Sailor's mouth , it's like writing a xxx novel but from a sick individual , telling things that are not in the image , like the subject is doing that so the subject want's to take it , or ready for action , set top_k high like 100-1000 and it will start saying do that to the subject or subject a .... so clearly I can't share none of that , it was a test . my data set revolves around 3d generation , but it was good to test and share .

super high values will give you more deranged and confused response . but sometimes you get lucky , all of the test i did will sometimes generate Chinese language paired with English ,

i only added the option to modify top_p top_K because my cogagent results was not the same as from this project .

i read the code and there was not top_p top_K at all maybe it was set from the model config , but it was different as night and day for normal censored captions .

and i did notice that qwen-VL-Chat use the same seed (1234) adding a random to it spice things up .

as for the Gpu ID SET CUDA_VISIBLE_DEVICES=0,1 will show the first and second gpu if you are doing multi gpu , did not test ALL

do_sample should be always true , unless you want a LLM , when false the response is unrelated to the image .

cogagent : top_p : 1-5 top_k :10-100
prompt : always start with (This image showcases). Describe this erot*c and lewd image in a very detailed manner using those words :here goes tags from wd tagger

qwen is the best if you want unacceptable uncensored barely accurate response .

BenDes21 commented 4 months ago

qwen

thing is that cog vlm give way more correct / accurate description than qwen I feel

mr-lab commented 4 months ago

qwen

thing is that cog vlm give way more correct / accurate description than qwen I feel

yes totally but qwen is easy to be forced into uncensored mode , check this example with the settings i said above : (beg for a kiss) Like What the hell so much more examples i would not share here:

a captivating goth girl with long hair and stunning features, looking directly into the camera with a seductive expression, She is laying on a bed with a soft white blanket, spreading her legs wide open and barefoot, revealing her sensitive pink *, Her lingerie is a black mesh bra and matching panty set, accentuating her hourglass curves, Her multicolored hair and two-tone eyes complete the look, making her appear even more alluring, Her are gently parted, revealing perfect pink **** that seem to beg for a kiss, The window blinds behind her provide the perfect amount of natural light

the example above the subject don't have have two-tone eyes , just one eye is kinda dark due to light the other is not . cog is short ,accurate and direct. what you can do too is download an uncensored LLM and provided it with cog caption and Wd tags and ask it , rewrite the "original description" below using the "tags provided" without changing the overall meaning into an truly uncensored lewd ero*ic description
original description : tags provided:

elismasilva commented 4 months ago

Hi guys, I've been really busy these last few days, I promise I'll pay special attention to your requests for the next version. I'm now trying to combine InstantID with InstantStyle, finishing this I'll return to image-interrogator, in the meantime you can continue sharing the experiences of the tests carried out.

BenDes21 commented 4 months ago

qwen

thing is that cog vlm give way more correct / accurate description than qwen I feel

yes totally but qwen is easy to be forced into uncensored mode , check this example with the settings i said above : (beg for a kiss) Like What the hell so much more examples i would not share here:

a captivating goth girl with long hair and stunning features, looking directly into the camera with a seductive expression, She is laying on a bed with a soft white blanket, spreading her legs wide open and barefoot, revealing her sensitive pink *, Her lingerie is a black mesh bra and matching panty set, accentuating her hourglass curves, Her multicolored hair and two-tone eyes complete the look, making her appear even more alluring, Her are gently parted, revealing perfect pink **** that seem to beg for a kiss, The window blinds behind her provide the perfect amount of natural light

the example above the subject don't have have two-tone eyes , just one eye is kinda dark due to light the other is not . cog is short ,accurate and direct. what you can do too is download an uncensored LLM and provided it with cog caption and Wd tags and ask it , rewrite the "original description" below using the "tags provided" without changing the overall meaning into an truly uncensored lewd ero*ic description original description : tags provided:

I think the method " rewrite the "original description" below using the "tags provided" from an uncensored model " is the best, problem is that in my case, I need to captions overs 5000 images so it's gonna be long ^^ if you know a way to batch process this its would be awesome, you already give me a part of the code but Im not super familiar with coding :/

mr-lab commented 4 months ago

elismasilva might add that , because really image-interrogator is super underrated , and from a logical perspective , it can load any LLM just like Text-webui and more since Text-webui don't support most VLMs . an option called rewrite using LLM would help a lot . heck even a batch processing of text using LLM will be huge , since none offers this option , and this whole project can evolve in a batch processing AI playground , things like : batch processes Code correction/reduction using deepseek coder Batch processes grammar and spelling check batch processes story writing ... endless possibilities
@elismasilva really thank your for you great work . i know it's too much , but give it a try .

BenDes21 commented 4 months ago

elismasilva might add that , because really image-interrogator is super underrated , and from a logical perspective , it can load any LLM just like Text-webui and more since Text-webui don't support most VLMs . an option called rewrite using LLM would help a lot . heck even a batch processing of text using LLM will be huge , since none offers this option , and this whole project can evolve in a batch processing AI playground , things like : batch processes Code correction/reduction using deepseek coder Batch processes grammar and spelling check batch processes story writing ... endless possibilities @elismasilva really thank your for you great work . i know it's too much , but give it a try .

would be awesome, that + the settings from Taggui ( screenshot I posted ) and we will have the best autocaptionner !

@elismasilva 🙏

elismasilva commented 4 months ago

elismasilva might add that , because really image-interrogator is super underrated , and from a logical perspective , it can load any LLM just like Text-webui and more since Text-webui don't support most VLMs . an option called rewrite using LLM would help a lot . heck even a batch processing of text using LLM will be huge , since none offers this option , and this whole project can evolve in a batch processing AI playground , things like : batch processes Code correction/reduction using deepseek coder Batch processes grammar and spelling check batch processes story writing ... endless possibilities @elismasilva really thank your for you great work . i know it's too much , but give it a try .

Thanks friend, I would appreciate it if you could open a new topic and structure all these features by topics describing what is expected by the feature and a possible example so I can attack them one by one and release them in the next versions as soon as possible. It's not my daily routine to work with caption and prompts, so you have more experience than I do about important needs with these models. So it would help me a lot if I could simulate these scenarios during development so that I can meet expectations. I believe that a project only stays alive if we both work together, evolving and improving.

elismasilva commented 4 months ago

elismasilva might add that , because really image-interrogator is super underrated , and from a logical perspective , it can load any LLM just like Text-webui and more since Text-webui don't support most VLMs . an option called rewrite using LLM would help a lot . heck even a batch processing of text using LLM will be huge , since none offers this option , and this whole project can evolve in a batch processing AI playground , things like : batch processes Code correction/reduction using deepseek coder Batch processes grammar and spelling check batch processes story writing ... endless possibilities @elismasilva really thank your for you great work . i know it's too much , but give it a try .

would be awesome, that + the settings from Taggui ( screenshot I posted ) and we will have the best autocaptionner !

@elismasilva 🙏

you can contribute with topics like i asked to mr-lab too, thank you.

elismasilva commented 4 months ago

Have you already tested this model ? https://github.com/PKU-YuanGroup/MoE-LLaVA?tab=readme-ov-file

I want to know if it is good to add on image-interrogator.

mr-lab commented 4 months ago

Have you already tested this model ? https://github.com/PKU-YuanGroup/MoE-LLaVA?tab=readme-ov-file

I want to know if it is good to add on image-interrogator.

deepspeed win issues ...