DeSinc / SallyBot

AI Chatbot coded in Discord.net C#
MIT License
299 stars 51 forks source link

Making Sally able to read images #21

Closed vilewired closed 1 year ago

vilewired commented 1 year ago

You might have heard of something called "OCR". OCR basically is a fancy term for text recognition. I would implement it myself but I have no experience in C#. It should be pretty straight forward to do though. I would recommend Tesseract because it is fast, it has high accuracy, and it requires barely any ressources to run. If you want to implement it in some way, then you should probably just download the binary from here and either put it in the same folder or add it to path. There is also a tutorial here on how to run another binary (.exe) from c# code. If you make your bot download the image you can just do tesseract out and a file called out.txt will be created containing the image text. If there is no text inside of the image, out.txt will be left empty so you can check for that aswell. After that you could feed that text in as a prompt for sally to react to. If you want to implement my idea and you have any questions to tesseract, adding things to path or anything, I would love to help you out in some way.

DeSinc commented 1 year ago

OCR would be a bit of work on my end to figure out how to use, which implementation to go for etc.. I actually had a separate idea to just use the existing stable diffusion clip interrogator to describe in words what is visually in the image so the LLM can "see" it, this might be more valuable and fun to use overall than an OCR system that knowing OCR software will probably be subpar let's be honest

but actually reading past the first sentence, yeah I'll check out tesseract

vilewired commented 1 year ago

Maybe make a mix of both? If there is text in an image, just read that to the LLM and let it respond to it, so that it kind of "reads" the image text. And if there is no text in the image ( for example a picture of a cat ), just pass that into the interrogator and let the LLM respond to that? I think it's a nice idea / concept that the bot can see and understand images. Makes it even more scary. If you need help with anything I'd love to help you figure things out with OCR and interrorgator altho I am not that experienced with c# :>