esmero / strawberryfield

A Field of strawberries
GNU Lesser General Public License v3.0
10 stars 5 forks source link

ISSUE-234: Make Flavor search aware of CWS/Children based OCR #235

Closed DiegoPino closed 1 year ago

DiegoPino commented 1 year ago

Still WIP don't even test

DiegoPino commented 1 year ago

@patdunlavey this is the code that solves an 85% of the use cases. Will add a guide with screenshots tomorrow but the code as it is should be ready. The other part goes into format_strawberryfield since metadata displays are defined there and I can not introduce a bidirectional dependency between modules.

DiegoPino commented 1 year ago

@patdunlavey,@alliomeria and @karomabiles (if you want to see this working) instructions for testing OCR for compounds:

And make sure each one also has "sequence_id" set, to 1, 2 and 3 respectively. If your webforms don't have that element/key (we need it) please add it or edit the JSON RAW. Save.

Make sure the Queue is processed (All Background ones that will generate OCR).

Mine is named:

image

and has these settings:

image

Basically you want to have the IABookreader but using the IIIF V3 CWS as template as source. Now Apply that view Mode to the Top Object by editing and forcing that Display Mode image

You should not need to reindex at this stage (if you followed this steps for this demo object)

Search for "Queen", "Pumpkin" and "King". Each should be highlighted correctly on its own page. Now search for "OCR" multiple pages.

This covers the basic use case where all children have a sequence_id and all are shown in the Manifest. Still working on the complex (a setting) use case where the structure shown is different, maybe only odd pages, etc.

Please let me know if you have issues/questions/needs

patdunlavey commented 1 year ago

@DiegoPino starting to look at this now (sorry for the delay!!!)

patdunlavey commented 1 year ago

@DiegoPino I was able to reproduce your steps, and your result! The only problem I noticed is that I don't get the pins in the result bar. I suspect that's due to me not being fully caught up to changes in the IIIF Presentation API 3 Creative Works Series Manifest.

I tested what happens when I add a second image file to one of the child objects. It seems to OCR correctly, but it is saved in the key_value table with the sequence number of "1", rather than that found at "as:image".*.sequence. As a result, when I display in the bookviewer, I get the additional page, but highlighting is off. In this case, I added your sample image file to a page in this object, and though it searches successfully for it (the word "queen" in this example), it highlights on the wrong page: image

Not sure if this is a simple problem to solve (and whether it's in the 15% you referred to!).

patdunlavey commented 1 year ago

Looking here, it seems like the sequence number should be correct. Not sure why it isn't!

DiegoPino commented 1 year ago

@patdunlavey adding a new page and having key_value = 1 is OK. I wonder if you added the "sequence_id" JSON KEY key to your new page/ADO?

DiegoPino commented 1 year ago

The actual page matching here depends on having a sequence_id at at Child ADO level. Without it, the Manifest is going to show pages in any order and won't match the response (and re-lative new ordering of results from the search) order that happens here now. The re-paging of the results happens here: https://github.com/esmero/strawberryfield/blob/3d022aeb07a85bd39c477790669ee1254f275fc2/src/Controller/StrawberryfieldFlavorDatasourceSearchController.php#L290 So if your ADO (the one that produced the HOCR) has no sequence_id it will return 1 and thus will offset all. Your new page should have sequence_id = 4 (in the JSON) now

DiegoPino commented 1 year ago

Also, the lack of pins in the result bar is strange. Are you using this on top of a custom code piece? e.g have you started modifying any other part of Archipelago already? Weird because on a fresh 1.0.0 I do see the pins .... maybe we need to have a call!

DiegoPino commented 1 year ago

@patdunlavey will merge and we open a new Pull/ISSUE for troubleshooting? There is more work to be done on SBFlavors for sure and I can add any corrections to a new pull.

patdunlavey commented 1 year ago

Sorry, I meant to get the results of my investigation in earlier! I'll make a new ticket for the multi-file sequencing issue.