feature request : periodic check

thiswillbeyourgithub commented 3 years ago

Hi,

I am an absolute fan of your addon and it became absolutely indispensable, thank you very much!

I regularly search my db for "<img -title" to scan new cards that don't have been OCR'd yet and was thinking that it would be useful to have it periodically check for that.

Say before synchronisation, a hook could be used to see if there is any non OCR'ed card. Just saying :)

Thanks again!

galantra commented 3 years ago

Thanks for the tip with the search!

thiswillbeyourgithub commented 3 years ago

@galantra I've been giving some thought about this string. It's actually not that good because it doesn't work if you have the word "title" anywhere in your card of if you OCR'd only one of thei mage of your card in the past and then added new images. These won't be found.

Using regex would be the best way I guess but lookahead and such are apparently not supported so far.

Realizing this makes me think this is actually a good reason for this addon to have a function that looks through the cards and find images that have not been OCR'd

edit : actually there could be a way but I can't make it work so far. Right now I'm trying things like <img -'re:<img.*?title=.*?>'

cfculhane commented 3 years ago

So I am checking for OCR'd text (this allows the removal of OCR text after its been created), and I do have a config flag to skip already OCR'd cards, but by default its set to always re-run the OCR as I didn't want the average user to leave it on and risk not getting OCR result for a modified card. What I could do is add a menu option using these internal functions to show 1) all cards that have OCR data and 2) all cards that do NOT have OCR data :) , I'll have a think and if its easy to do I'll add it to the next version

thiswillbeyourgithub commented 3 years ago

But how would you handle cards that contain several pictures, one of them being OCR'd and one that has not been? I frequently encounter this situation as I often add additional sources to my cards over time

galantra commented 3 years ago

@galantra I've been giving some thought about this string. It's actually not that good because it doesn't work if you have the word "title" anywhere in your card of if you OCR'd only one of thei mage of your card in the past and then added new images. These won't be found.

Right, these are considerable limitations.

thiswillbeyourgithub commented 3 years ago

You will be interested in reading this then https://github.com/cfculhane/AnkiOCR/issues/17

It solves the "can contain 'title' elsewhere" issue. Which is somewhat better

thiswillbeyourgithub commented 3 years ago

Sorry for the double notification but I finally get what you meant @cfculhane Indeed I think that allowing to set the flag to "ignore already OCR'ed" images would solve the issue. Especially if you find a way to process more than 1000 cards in a row. I'll add a message to that issue btw.

cfculhane commented 3 years ago

I think I've come up with a viable solution - I'll take any images embedded in the note, and hash the value of the concatenation of the file names, and store this value. Then by rehashing it at a later date, I will easily be able to work out if the card has had images added to it or not, and re run the ocr. I could hash the image data itself, but anki already assigns unique file names to its images in media. collection so I think this approach is sufficient and likely to be faster

On Tue, 2 Mar 2021, 7:49 pm thiswillbeyourgithub, notifications@github.com wrote:

Sorry for the double notification but I finally get what you meant @cfculhane https://github.com/cfculhane Indeed I think that allowing to set the flag to "ignore already OCR'ed" images would solve the issue. Especially if you find a way to process more than 1000 cards in a row. I'll add a message to that issue btw.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cfculhane/AnkiOCR/issues/9#issuecomment-788736315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJRYIU6CIWYYPKZ25X5TN6TTBSRBZANCNFSM4VNC2D7Q .

thiswillbeyourgithub commented 3 years ago

That's one way to do it. But wouldn't it be simpler to just use regexp to find if there is an image that doesn't contain 'title="OCR:XXXXXXXXXXX"' with X being previously ocr'd text ?

This seems more robust to me.

cfculhane commented 3 years ago

I could do that, but there are some users who are inserting the ocr data into an ocr field on the card, which makes it a bit trickier

On Wed, 3 Mar 2021, 3:20 am thiswillbeyourgithub, notifications@github.com wrote:

That's one way to do it. But wouldn't it be simpler to just use regexp to find if there is an image that doesn't contain 'title="OCR:XXXXXXXXXXX"' with X being previously ocr'd text ?

This seems more robust to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cfculhane/AnkiOCR/issues/9#issuecomment-789028350, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJRYIU5LNXBZFPA7ZUFQPELTBUF3LANCNFSM4VNC2D7Q .

thiswillbeyourgithub commented 3 years ago

Fairly good point then.

cfculhane / AnkiOCR

feature request : periodic check #9