Closed patdunlavey closed 2 months ago
@patdunlavey yes, that particular change is very easy. But there is another blocker. Defining the sequence ID and in the context of a Dynamic IIIF Manifest, what that means...
Let me explain (means I need your ideas/help)
ocr
tag used in the Strawberry Flavor Sold DOCs unique ID (remember that mix of Node UUID, File UUID, parent, plugin id etc) is based on the unique plugin ID given. By default the current PDF one is ocr
. If we create a new Plugin instance to deal with TIFFs, etc, we want to allow to another option to set that KEY. So both, the TIFFs and the PDF pages share the ocr
key. OR, make it fixed in this type of processor (might remove flexibility in the future but would also give us "immediate" usability@patdunlavey this here: https://github.com/esmero/strawberry_runners/blob/dbbdcb6a7cca4e73b1c66706567c61be2965b01d/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php#L457 needs to come from a setting in case we are not dealing with a PDF (e.g sequence_id
JSON key (the value) and should be exposed in the config form maybe even exposed ONLY if the source is as:image
@DiegoPino I tried to spitball some code in this PR.
Update:
I made a couple further corrections in the PR. In my testing, it appears to successfully generate and index OCR for single and multiple image file objects. Not all perfect in some quick testing:
"msg":"Exception writing document id gg2me1-default_solr_index-strawberryfield_flavor_datasource/33:1:en:10123392-bafa-45aa-bd50-f9d9636ef6ed:ocr_single to the index; possible analysis error."
.I checked that the queue entries include a sequence value that corresponds to the sequence number in the metadata, so that part seems to be working.
Hey! This is wonderful! 🥰Will do a thorough review (a caring one) first hour in the morning. Thx so much!!!
On Tue, Mar 8, 2022 at 7:31 PM Pat Dunlavey @.***> wrote:
Update:
I made a couple further corrections in the PR. In my testing, it appears to successfully generate and index OCR for single and multiple image file objects. Not all perfect in some quick testing:
- Some of my test images are failing to index. Sample error (dumped to the console when I use drush queue:run): "msg":"Exception writing document id gg2me1-default_solr_index-strawberryfield_flavor_datasource/33:1:en:10123392-bafa-45aa-bd50-f9d9636ef6ed:ocr_single to the index; possible analysis error.".
- Global search only seems to find content from the first file.
I checked that the queue entries include a sequence value that corresponds to the sequence number in the metadata, so that part seems to be working.
— Reply to this email directly, view it on GitHub https://github.com/esmero/strawberry_runners/issues/46#issuecomment-1062286085, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7ZZ6H44TGXDDCCJS7G43U67IKLANCNFSM5QGQXY2A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)
I figured out that the reason some documents were refusing to index was because, as a result of having branched from main rather than 0.3.0, I did not have a fix for this. I merged 0.3.0 into my issue branch, and changed the pull request to target 0.3.0. So that resolved that issue.
Two other issues remain: the fact that global search only discovers content in one of the OCR'd and indexed files on a multi-file object; and the bookreader does not show multi-file books as searchable.
I suspect that the first problem - that text in only one of the OCR'd files is found in global search - is related to something that I see in the solr indexed data. All of the strawberryfield_flavor_datasource records are showing "1" as the sequence_id. This is despite the fact that I'm pretty sure the files are getting proper sequence numbers going into tesseract. If you have any ideas about this, let me know @DiegoPino .
The second problem - that the bookreader doesn't show the multi-file book content as searchable - seems like it could be related to the first. Another possibility is that I am using a separate strawberry runner for non-paged files and maybe that's confusing things.
@DiegoPino I pushed up some more work on this that I think gets the sequence ID working pretty well. The biggest part I'm not sure about is if I may be screwing up other processors that do not use sequence id as their input_argument.
I was wrong in my complaint that global search doesn't find content in some OCR'd files. In fact global search doesn't find any OCR content! I now understand that and why it is so (it seems like maybe adding a relation in the view to strawberry flavor datasources could let us have a search that finds nodes whose associated strawberryflavor datasource entities contain the search string?).
The absence of search within the bookreader remains a problem, but I'm thinking that may be a separate issue? Do you have any enlightenment to provide on that @DiegoPino ? Might it depend on the second item listed on this issue "Add for each Page (no collapsed data) an extra location of HOCR URL"
Hi Pat,
I need to check that logic (how the id is passed around), probably the only thing or your pull that is breaking the idea that a processor should be self sufficient (the deal) and it might break other processors. Will check all once you tell me you are done (I was about today but then saw some code coming from you)
RC3 has a standard view for that...https://studio.archipelago.nyc/search_pages https://studio.archipelago.nyc/search_pages does that one not work? I mean you could also display all in a single view but that would also break a “deal” (relevance of content search v/s metadata search)
Give me a little while, bit stumped with other code but will give you a few solutions. Book reader problem is really not big, its mostly a naming convention of each page (so should be as easy as documenting/adapting the twig templates for IIIF) so the Search knows where/how to find the values but might require some JS to be more robust (just maybe)
More tomorrow, thanks
Hi Diego,
I am done for now, pending your feedback. I won't be able to do much today (my son is home from his day program), but will try to address anything you put back to me as quickly as I can. If you have thoughts for how to isolate the necessary code changes to the OCR processor, I'm definitely all ears, but since the sequence ID is provided to the OCR processor, and utilized outside of it, I don't see how that's possible.
Derek pointed out that search_pages view to me, which I had forgotten about. But I'm not sure what you mean about a "deal" that implies that global fulltext search is only interested in object metadata. As a user, when I search for a word, I think my expectation is that I'm doing a content and metadata search. But that's not the subject of this issue, so let's not discuss it further here. I'm sorry I brought it into this discussion.
I'm glad to hear that bookreader will not be hard to solve. Will you provide specific direction for that?
Thanks!!
No worries at all about response speed. It will take me a lot of testing/debugging and code comparison to have a proper review. Will of course help/code to make the bookreader work
re: search. You can (in your own institution) mix and match results in a single View. The issue I see is that by default (and we can maybe work on that?) Strawberry Runners have NO View modes. They do not even exist. So you have to depend on fields to display, which means your global search View would need to be tuned. That is all. What a user expects/not expects is very domain driven and tbh most users will expect what they are used to, which does not always mean you can not provide a different alternative/persective. Not a critic, just a small statement about expectations
hugs and good luck today
Resolved
In OcrPostProcessor, where it builds the command to run tesseract, the command always emerges in the form:
I.e. it only works with pdf files as input! Any raster file (tiff, jpeg, etc) results in no OCR being generated.
It should first check the input file to see if tesseract can run on it directly, and if not, then test if ghostscript can convert it into a file that tesseract can run on.
@DiegoPino I'll take this on, it seems pretty easy -- that is unless I have misdiagnosed this problem!