tesseract OCR only takes pdf files as input

patdunlavey commented 2 years ago

In OcrPostProcessor, where it builds the command to run tesseract, the command always emerges in the form:

{{ ghostscript command that takes the file and tries to generate a png from it }} && {{ tesseract command that uses the png as input }}

I.e. it only works with pdf files as input! Any raster file (tiff, jpeg, etc) results in no OCR being generated.

It should first check the input file to see if tesseract can run on it directly, and if not, then test if ghostscript can convert it into a file that tesseract can run on.

@DiegoPino I'll take this on, it seems pretty easy -- that is unless I have misdiagnosed this problem!

DiegoPino commented 2 years ago

@patdunlavey yes, that particular change is very easy. But there is another blocker. Defining the sequence ID and in the context of a Dynamic IIIF Manifest, what that means...

Let me explain (means I need your ideas/help)

With PDF, the original use case that we built this the sequence order of pages is not ambiguous. We know exactly which page number is each. But with Images, depending on how many Images there are Inside a single ADO and IF, the Object that holds them is Part of a TOP one (multiple pages bound to a book or a creative work series) knowing which TIFF is page 1 requires extra options in the processor

What Options?

Allow to setup a JSON key as the source for the sequence (we use sequence_id as default - an implicit default driven by Solr and the Views.. so requires documenting that if you change that you need to change the Views that drive sequence order)
Allow to setup also the internal (per image) sequence order in case of multi Image Objects
Allow an option (maybe not even desired) to use a IIIF manifest (exposed endpoint) as the source for the sequence in case your logic is strange/uses TOC etc. This might not be needed IF we allow the actual search/interaction logic of the Viewers to deal with that when hitting the Search.. (does that make sense?)

Errors/bad design/improvements

Right now the ocr tag used in the Strawberry Flavor Sold DOCs unique ID (remember that mix of Node UUID, File UUID, parent, plugin id etc) is based on the unique plugin ID given. By default the current PDF one is ocr. If we create a new Plugin instance to deal with TIFFs, etc, we want to allow to another option to set that KEY. So both, the TIFFs and the PDF pages share the ocr key. OR, make it fixed in this type of processor (might remove flexibility in the future but would also give us "immediate" usability
We need to check HOW IABookeader is interacting/passing (we build this so we can) via JS the right endpoint query to Archipelago so multi TIFF objects OR compound/creative work series are searchable. This is very dependent on the originating IIIF manifest!

DiegoPino commented 2 years ago

@patdunlavey this here: https://github.com/esmero/strawberry_runners/blob/dbbdcb6a7cca4e73b1c66706567c61be2965b01d/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php#L457 needs to come from a setting in case we are not dealing with a PDF (e.g sequence_id JSON key (the value) and should be exposed in the config form maybe even exposed ONLY if the source is as:image

patdunlavey commented 2 years ago

@DiegoPino I tried to spitball some code in this PR.

patdunlavey commented 2 years ago

Update:

I made a couple further corrections in the PR. In my testing, it appears to successfully generate and index OCR for single and multiple image file objects. Not all perfect in some quick testing:

Some of my test images are failing to index. Sample error (dumped to the console when I use drush queue:run): "msg":"Exception writing document id gg2me1-default_solr_index-strawberryfield_flavor_datasource/33:1:en:10123392-bafa-45aa-bd50-f9d9636ef6ed:ocr_single to the index; possible analysis error.".
Global search only seems to find content from the first file.
Not seeing search in the Book Reader for OCR'd content.

I checked that the queue entries include a sequence value that corresponds to the sequence number in the metadata, so that part seems to be working.

DiegoPino commented 2 years ago

Hey! This is wonderful! 🥰Will do a thorough review (a caring one) first hour in the morning. Thx so much!!!

On Tue, Mar 8, 2022 at 7:31 PM Pat Dunlavey @.***> wrote:

Update:

I made a couple further corrections in the PR. In my testing, it appears to successfully generate and index OCR for single and multiple image file objects. Not all perfect in some quick testing:

Some of my test images are failing to index. Sample error (dumped to the console when I use drush queue:run): "msg":"Exception writing document id gg2me1-default_solr_index-strawberryfield_flavor_datasource/33:1:en:10123392-bafa-45aa-bd50-f9d9636ef6ed:ocr_single to the index; possible analysis error.".

Global search only seems to find content from the first file.

I checked that the queue entries include a sequence value that corresponds to the sequence number in the metadata, so that part seems to be working.

— Reply to this email directly, view it on GitHub https://github.com/esmero/strawberry_runners/issues/46#issuecomment-1062286085, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7ZZ6H44TGXDDCCJS7G43U67IKLANCNFSM5QGQXY2A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

patdunlavey commented 2 years ago

I figured out that the reason some documents were refusing to index was because, as a result of having branched from main rather than 0.3.0, I did not have a fix for this. I merged 0.3.0 into my issue branch, and changed the pull request to target 0.3.0. So that resolved that issue.

Two other issues remain: the fact that global search only discovers content in one of the OCR'd and indexed files on a multi-file object; and the bookreader does not show multi-file books as searchable.

I suspect that the first problem - that text in only one of the OCR'd files is found in global search - is related to something that I see in the solr indexed data. All of the strawberryfield_flavor_datasource records are showing "1" as the sequence_id. This is despite the fact that I'm pretty sure the files are getting proper sequence numbers going into tesseract. If you have any ideas about this, let me know @DiegoPino .

The second problem - that the bookreader doesn't show the multi-file book content as searchable - seems like it could be related to the first. Another possibility is that I am using a separate strawberry runner for non-paged files and maybe that's confusing things.

patdunlavey commented 2 years ago

@DiegoPino I pushed up some more work on this that I think gets the sequence ID working pretty well. The biggest part I'm not sure about is if I may be screwing up other processors that do not use sequence id as their input_argument.

I was wrong in my complaint that global search doesn't find content in some OCR'd files. In fact global search doesn't find any OCR content! I now understand that and why it is so (it seems like maybe adding a relation in the view to strawberry flavor datasources could let us have a search that finds nodes whose associated strawberryflavor datasource entities contain the search string?).

The absence of search within the bookreader remains a problem, but I'm thinking that may be a separate issue? Do you have any enlightenment to provide on that @DiegoPino ? Might it depend on the second item listed on this issue "Add for each Page (no collapsed data) an extra location of HOCR URL"

DiegoPino commented 2 years ago

Hi Pat,

I need to check that logic (how the id is passed around), probably the only thing or your pull that is breaking the idea that a processor should be self sufficient (the deal) and it might break other processors. Will check all once you tell me you are done (I was about today but then saw some code coming from you)

RC3 has a standard view for that...https://studio.archipelago.nyc/search_pages https://studio.archipelago.nyc/search_pages does that one not work? I mean you could also display all in a single view but that would also break a “deal” (relevance of content search v/s metadata search)

Give me a little while, bit stumped with other code but will give you a few solutions. Book reader problem is really not big, its mostly a naming convention of each page (so should be as easy as documenting/adapting the twig templates for IIIF) so the Search knows where/how to find the values but might require some JS to be more robust (just maybe)

More tomorrow, thanks

patdunlavey commented 2 years ago

Hi Diego,

I am done for now, pending your feedback. I won't be able to do much today (my son is home from his day program), but will try to address anything you put back to me as quickly as I can. If you have thoughts for how to isolate the necessary code changes to the OCR processor, I'm definitely all ears, but since the sequence ID is provided to the OCR processor, and utilized outside of it, I don't see how that's possible.

Derek pointed out that search_pages view to me, which I had forgotten about. But I'm not sure what you mean about a "deal" that implies that global fulltext search is only interested in object metadata. As a user, when I search for a word, I think my expectation is that I'm doing a content and metadata search. But that's not the subject of this issue, so let's not discuss it further here. I'm sorry I brought it into this discussion.

I'm glad to hear that bookreader will not be hard to solve. Will you provide specific direction for that?

Thanks!!

DiegoPino commented 2 years ago

No worries at all about response speed. It will take me a lot of testing/debugging and code comparison to have a proper review. Will of course help/code to make the bookreader work

re: search. You can (in your own institution) mix and match results in a single View. The issue I see is that by default (and we can maybe work on that?) Strawberry Runners have NO View modes. They do not even exist. So you have to depend on fields to display, which means your global search View would need to be tuned. That is all. What a user expects/not expects is very domain driven and tbh most users will expect what they are used to, which does not always mean you can not provide a different alternative/persective. Not a critic, just a small statement about expectations

hugs and good luck today

DiegoPino commented 2 months ago

Resolved

esmero / strawberry_runners

tesseract OCR only takes pdf files as input #46

What Options?

Errors/bad design/improvements