HOCR my old friend: enable full HOCR pipeline for IAbookreader

DiegoPino commented 3 years ago

See #11 related to this and all the work that Giancarlo has been doing in the last week

The plan

Open an issue for all this ✔️
Code an endpoint with configuration (because i need to know which search_api / solr/ core i need to call
Add the the endpoint for now a fake response
Add a config/override for the IAbookreader with the endpoint so it uses t
Make an override of the search callback override and log it out and i may want all the binaries commands we need to call from 1) PDF to 2) multiple miniOCR (edited)
Make the sbr processor for this (strawberry_runners)
Deploy https://github.com/dbmdz/solr-ocrhighlighting with Giancarlo's Schema here https://github.com/dbmdz/solr-ocrhighlighting/issues/49#issuecomment-729083760. I may want to ask what is the best way. get from GitHub and we may documentation.

giancarlobi commented 3 years ago

About 4. I added here https://github.com/esmero/format_strawberryfield/blob/ee9bdaea46f3c7074f1b82668b36b8e2737aaeae/js/iiif-iabookreader_strawberry.js#L21 some more options to make IAB uses my endpoint:

                            maxWidth: 800,
                            imagesBaseURL: 'https://cdn.jsdelivr.net/gh/internetarchive/bookreader@4.21.0/BookReader/images/',
+                            server: 'archipelago.byterfly.eu',
+                            bookId: 'TheBookID',
+                            searchInsideUrl: '/endpoint.php',

giancarlobi commented 3 years ago

About 7. I prefer the field type "text_ocr_stored" For plugin install: 1) download last jar from https://github.com/dbmdz/solr-ocrhighlighting/releases 2) copy to /opt/solr/contrib/ocrsearch/lib/ 3) add to solrconfig.xml

<lib dir="${solr.install.dir:../../../..}/contrib/ocrsearch/lib" regex=".*\.jar" />

<searchComponent class="de.digitalcollections.solrocr.solr.OcrHighlightComponent" name="ocrHighlight" />

4) edit solrconfig_extra.xml and set right order of highlighter into select:

<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">lucene</str>
    <str name="df">id</str>
    <str name="echoParams">explicit</str>
    <str name="omitHeader">true</str>
    <str name="timeAllowed">${solr.selectSearchHandler.timeAllowed:-1}</str>
    <str name="spellcheck">false</str>
  </lst>
  <arr name="last-components">
    <str>ocrHighlight</str>
    <str>highlight</str>
    <str>spellcheck</str>
    <str>elevator</str>
  </arr>
</requestHandler>

5) edit schema_extra_types.xml and add new type (NB this is for inline store of hOCR/MiniOCR)

    <fieldtype name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
      <analyzer type="index">
        <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldtype>

6) edit schema_extra_fields.xml and add :

<field name="ocr_text_stored" type="text_ocr_stored" multiValued="false" indexed="true" stored="true" />

giancarlobi commented 3 years ago

About 7. To check if Solr plugin works you can update Solr doc with Solr post tool /opt/solr/bin/post and this json file:

{
    "id": "ocrdoc-1-stored",
    "ocr_text_stored": "<?xml version='1.0' encoding='UTF-8'?>
<ocr>
<p xml:id=\"0\" wh=\"1836 2596\">
<b>
<l><w x=\"385 631 566 666\">ISTITUTO<\/w> <w x=\"583 631 621 666\">DI<\/w> <w x=\"639 631 820 666\">RICERCA<\/w> <w x=\"837 631 972 666\">SULLA<\/w> <w x=\"989 631 1190 666\">CRESCITA<\/w> <w x=\"1205 631 1459 666\">ECONOMICA<\/w> <w x=\"1477 631 1738 666\">SOSTENIBILE<\/w> <\/l>
<l><w x=\"451 683 675 718\">RESEARCH<\/w> <w x=\"693 683 903 718\">INSTITUTE<\/w> <w x=\"922 683 980 718\">ON<\/w> <w x=\"999 683 1288 718\">SUSTAINABLE<\/w> <w x=\"1304 683 1528 718\">ECONOMIC<\/w> <w x=\"1546 683 1736 718\">GROWTH<\/w> <\/l>
<l><w x=\"633 1532 1000 1603\">Numero<\/w> <w x=\"1032 1531 1104 1618\">6,<\/w> <w x=\"1140 1528 1486 1622\">maggio<\/w> <w x=\"1515 1531 1740 1603\">2018<\/w> <\/l>
<l><w x=\"1371 1980 1482 2009\">Follow<\/w> <w x=\"1494 1980 1549 2009\">the<\/w> <w x=\"1565 1979 1697 2017\">Byterfly<\/w> <\/l>
<l><w x=\"1226 2041 1287 2070\">and<\/w> <w x=\"1302 2042 1396 2078\">enjoy<\/w> <w x=\"1408 2049 1493 2078\">open<\/w> <w x=\"1508 2041 1695 2078\">knowledge<\/w> <\/l>
<l><w x=\"1082 2155 1293 2183\">GIANCARLO<\/w> <w x=\"1304 2155 1457 2189\">BIRELLO,<\/w> <w x=\"1469 2156 1577 2183\">ANNA<\/w> <w x=\"1590 2156 1698 2183\">PERIN<\/w> <\/l>
<l><w x=\"1323 128 1402 156\">ISSN<\/w> <w x=\"1413 126 1536 164\">(print):<\/w> <w x=\"1546 128 1734 156\">2421-5783<\/w> <\/l>
<l><w x=\"1288 181 1368 209\">ISSN<\/w> <w x=\"1379 179 1434 216\">(on<\/w> <w x=\"1447 179 1536 216\">line):<\/w> <w x=\"1546 181 1734 209\">2421-5562<\/w> <\/l>
<l><w x=\"548 878 1734 1128\">Rapporto<\/w> <\/l>
<l><w x=\"805 1151 1734 1358\">Tecnico<\/w> <\/l>
<\/b>
<\/p>
<\/ocr>"
}

Then query by this select: ..../select?hl.ocr.fl=ocr_text_stored&hl=true&q=ocr_text_stored%3Amaggio&hl.ocr.absoluteHighlights=on You must see something like this as result:

{
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"ocrdoc-1-stored",
        "ocr_text_stored":"<?xml version='1.0' encoding='UTF-8'?>\n<ocr>\n<p xml:id=\"0\" wh=\"1836 2596\">\n<b>\n<l><w x=\"385 631 566 666\">ISTITUTO</w> <w x=\"583 631 621 666\">DI</w> <w x=\"639 631 820 666\">RICERCA</w> <w x=\"837 631 972 666\">SULLA</w> <w x=\"989 631 1190 666\">CRESCITA</w> <w x=\"1205 631 1459 666\">ECONOMICA</w> <w x=\"1477 631 1738 666\">SOSTENIBILE</w> </l>\n<l><w x=\"451 683 675 718\">RESEARCH</w> <w x=\"693 683 903 718\">INSTITUTE</w> <w x=\"922 683 980 718\">ON</w> <w x=\"999 683 1288 718\">SUSTAINABLE</w> <w x=\"1304 683 1528 718\">ECONOMIC</w> <w x=\"1546 683 1736 718\">GROWTH</w> </l>\n<l><w x=\"633 1532 1000 1603\">Numero</w> <w x=\"1032 1531 1104 1618\">6,</w> <w x=\"1140 1528 1486 1622\">maggio</w> <w x=\"1515 1531 1740 1603\">2018</w> </l>\n<l><w x=\"1371 1980 1482 2009\">Follow</w> <w x=\"1494 1980 1549 2009\">the</w> <w x=\"1565 1979 1697 2017\">Byterfly</w> </l>\n<l><w x=\"1226 2041 1287 2070\">and</w> <w x=\"1302 2042 1396 2078\">enjoy</w> <w x=\"1408 2049 1493 2078\">open</w> <w x=\"1508 2041 1695 2078\">knowledge</w> </l>\n<l><w x=\"1082 2155 1293 2183\">GIANCARLO</w> <w x=\"1304 2155 1457 2189\">BIRELLO,</w> <w x=\"1469 2156 1577 2183\">ANNA</w> <w x=\"1590 2156 1698 2183\">PERIN</w> </l>\n<l><w x=\"1323 128 1402 156\">ISSN</w> <w x=\"1413 126 1536 164\">(print):</w> <w x=\"1546 128 1734 156\">2421-5783</w> </l>\n<l><w x=\"1288 181 1368 209\">ISSN</w> <w x=\"1379 179 1434 216\">(on</w> <w x=\"1447 179 1536 216\">line):</w> <w x=\"1546 181 1734 209\">2421-5562</w> </l>\n<l><w x=\"548 878 1734 1128\">Rapporto</w> </l>\n<l><w x=\"805 1151 1734 1358\">Tecnico</w> </l>\n</b>\n</p>\n</ocr>",
        "timestamp":"2020-11-17T17:05:21.248Z",
        "_version_":1683627936321634304}]
  },
  "highlighting":{
    "ocrdoc-1-stored":{
      "id":["<em>ocrdoc-1-stored</em>"]}},
  "ocrHighlighting":{
    "ocrdoc-1-stored":{
      "ocr_text_stored":{
        "snippets":[{
            "text":"ISTITUTO DI RICERCA SULLA CRESCITA ECONOMICA SOSTENIBILE RESEARCH INSTITUTE ON SUSTAINABLE ECONOMIC GROWTH Numero 6, <em>maggio</em> 2018 Follow the Byterfly and enjoy open knowledge",
            "score":42.31104,
            "pages":[{
                "id":"0",
                "width":1836,
                "height":2596}],
            "regions":[{
                "ulx":385,
                "uly":631,
                "lrx":3282,
                "lry":4127,
                "text":"ISTITUTO DI RICERCA SULLA CRESCITA ECONOMICA SOSTENIBILE RESEARCH INSTITUTE ON SUSTAINABLE ECONOMIC GROWTH Numero 6, <em>maggio</em> 2018 Follow the Byterfly and enjoy open knowledge",
                "pageIdx":0}],
            "highlights":[[{
                  "ulx":1140,
                  "uly":1528,
                  "lrx":2626,
                  "lry":3150,
                  "text":"maggio",
                  "parentRegionIdx":0}]]}],
        "numTotal":1}}},
  "highlighting":{}}

giancarlobi commented 3 years ago

@DiegoPino some thoughts about PDF/IIF/resolution. I start from final note: I think that using indentify/pdfinfo for image WxH we are loosing resolution of original image. I check that by 2 pdf: one generated from DOCX (https://archipelago.byterfly.eu/node/29) and one generated by abbey from TIFF (300 dpi) (https://archipelago.byterfly.eu/do/750aeedb-9a86-4bdd-bf93-4a5377e149af). For pdf_docx identify report 595x842 pts (793x1123 px) and for pdf_tiff 325x491 pts (439x651 px) If I query Cantaloupe for first page and full resolution /full/full/0/default.jpg?page=1 I get: pdf_docx (1240x1753 px) and pdf_tiff (686x1016 px) That are the same values I get from info.json from Cantaloupe. So my conclusion is that we have to use info.json width and height (the first ones returned in main array) and not the ones returned by identify. Just a note, to discuss and what is better and simple to manage.

giancarlobi commented 3 years ago

And a related note: Do we really need to store into SBF JSON WxD for each page? what about a pdf with 1000 pages? I think we can save that space.

DiegoPino commented 3 years ago

Hi, yes. We can discuss this and i’m with asking cantaloupe if that works for you. The cantaloupe value is really based on the rastering resolution in the cantaloupe properties file so also variable. Best way may to define a common resolution value and apply the same everywhere (so a setting) and we just multiply. Let’s talk about this, not different to exporting tiffs manually really except that we need to be consistent everywhere here and with tiff we can be making a mistake (wrong dpi) and have to live with the small tiff forever

Are there better id tools for pdf?

El El sáb, 21 de nov. de 2020 a la(s) 09:05, Giancarlo < notifications@github.com> escribió:

@DiegoPino https://github.com/DiegoPino some thoughts about PDF/IIF/resolution. I start from final note: I think that using indtify/pdfinfo for image WxH we are loosing resolution of original image. I check that by 2 pdf: one generated from DOCX ( https://archipelago.byterfly.eu/node/29) and one generated by abbey from TIFF (300 dpi) ( https://archipelago.byterfly.eu/do/750aeedb-9a86-4bdd-bf93-4a5377e149af). For pdf_docx identify report 595x842 pts (793x1123 px) and for pdf_tiff 325x491 pts (439x651 px) If I query Cantaloupe for first page and full resolution /full/full/0/default.jpg?page=1 I get: pdf_docx (1240x1753 px) and pdf_tiff (686x1016 px) That are the same values I get from info.json from Cantaloupe. So my conclusion is that we have to use info.json width and height (the first ones returned in main array) and not the ones returned by identify. Just a note, to discuss and what is better and simple to manage.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/esmero/format_strawberryfield/issues/105#issuecomment-731584223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7ZZ7REA7CEYZLXK6RI7DSQ7CL3ANCNFSM4T35MTBA .

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

giancarlobi commented 3 years ago

If we use contaloupe jnfo.json response, that works also with tiff and not only for pdf and we don't lost resolution. If we define a default resolution we can lost tiff dpi and we have to manage portrait/landscape page orientation too. I don't found any better tools to manage and get pdf dimensions, that are not the dimensions/pixel of included image and nothing to do with cantaloupe response. Well, to think and to discuss, friend, have a nice day.

giancarlobi commented 3 years ago

In addition, as you already asserted, for book made by tiff (i.e. https://archipelago.byterfly.eu/node/18) then Cantaloupe info.json returns 2481x3508 px as the original tiff dimensions are, so limit to a fixed value (i.e. 1200 width) means lost resolution respect tiff stored.

giancarlobi commented 3 years ago

And also this: Cantaloupe pdf rasterized image depends on this conf param processor.dpi = 150 as reported in doc here https://cantaloupe-project.github.io/manual/4.1/processors.html#PdfBoxProcessor So, for PDF the right WxD that we have to use also depend on cantaloupe conf.

giancarlobi commented 3 years ago

@DiegoPino I was thinking more about how Archipelago have to manage ADO paged objects. Evaluating how viewers (first of all IAB but also valid for Mirador) manage images, the high importance of IIIF and Manifest, the performance of Solr indexing/query, the availability of a (it seems) so good plugin for hOCR/MiniOCR for Solr and some personal feelings, I made this (new) idea of manage ADO paged context:

Solr doc have to store ADO reference + page reference + width and height (as returned by cantaloupe info.json) + MiniOCR
We don't have to store anything or almost anything of above into SBF-JSON (i.e. thinking a book with really many pages)
IIIF manifest has to be "hardcoded" that is, a service with really few settings, passing to the service ADO ref it returns the manifest making a query to Solr for page WxH, it could be an IIIF manifest endpoint public available
Solr doc update has to be managed by a dedicated (at the beginning, not customizable) service/flavours executed after ADO creation, this can support a yes/no option for the user or something the user decide to executed later or just after ADO ingest
Manage hOCR by zip it's a good choice but as we store all into Solr docs, zip storing could be not really needed, almost we can store into Solr a checksum to evaluate if something changes
regarding IAB, it uses manifest WxD as default settings (manifest returned by service based on Solr query) so when search the IAB search endpoint has to A) query Solr for term searched filtering by ADO ref B) transform coordinates multiplicating relatives value * width(height) returned. We can choose to store into Solr absolute coordinate values, this save a calculation into IAB search endpoint but I don't have clear if this is good also for Mirador, to check

Well, more things to discuss ... have a nice Sunday, amigo

DiegoPino commented 3 years ago

Hi, i will read this in detail and will reply to each point tomorrow (need to test code to be sure what i say is correct) but even when I understand your use case if feel it is totally not the archipelago way of having anything hardcoded. If hardcoded means you can make the exact manifest that works for your use case and setup all the rest based on that one (settings can be even automatically saved by parsing the manifest once during setup) then great, but not in code. if viewers are able to adapt to a manifest that is unknown and variable why not we too? If we go that way we will totally deviate from what we are as a project just to serve a single need. I agree with the hocr sbr, no settings for that, too much logic to make it configurable and about what we store in the sbf, well up to each institution, probably can make the postprocessor more configurable. I have no personal issues yet with 1000+ pages but we may have. We can also only store a main width/height and then only pages that deviate from that. We may need to keep exploring what we need and where in the workflow the lowest effort/complexity denominator is until we find the solution. Let’s have a call tomorrow or Tuesday and we will for sure figure it out

Enjoy a peaceful sunday!

El El dom, 22 de nov. de 2020 a la(s) 07:55, Giancarlo < notifications@github.com> escribió:

@DiegoPino https://github.com/DiegoPino I was thinking more about how Archipelago have to manage ADO paged objects. Evaluating how viewers (first of all IAB but also valid for Mirador) manage images, the high importance of IIIF and Manifest, the performance of Solr indexing/query, the availability of a (it seems) so good plugin for hOCR/MiniOCR for Solr and some personal feelings, I made this (new) idea of manage ADO paged context:

-

Solr doc have to store ADO reference + page reference + width and height (as returned by cantaloupe info.json) + MiniOCR

We don't have to store anything or almost anything of above into SBF-JSON (i.e. thinking a book with really many pages)

IIIF manifest has to be "hardcoded" that is, a service with really few settings, passing to the service ADO ref it returns the manifest making a query to Solr for page WxH, it could be an IIIF manifest endpoint public available

Solr doc update has to be managed by a dedicated (at the beginning, not customizable) service/flavours executed after ADO creation, this can support a yes/no option for the user or something the user decide to executed later or just after ADO ingest

Manage hOCR by zip it's a good choice but as we store all into Solr docs, zip storing could be not really needed, almost we can store into Solr a checksum to evaluate if something changes

regarding IAB, it uses manifest WxD as default settings (manifest returned by service based on Solr query) so when search the IAB search endpoint has to A) query Solr for term searched filtering by ADO ref B) transform coordinates multiplicating relatives value * width(height) returned. We can choose to store into Solr absolute coordinate values, this save a calculation into IAB search endpoint but I don't have clear if this is good also for Mirador, to check

Well, more things to discuss ... have a nice Sunday, amigo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esmero/format_strawberryfield/issues/105#issuecomment-731743944, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7ZZ562TOUIBO66E2DKZ3SREC5HANCNFSM4T35MTBA .

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

giancarlobi commented 3 years ago

@DiegoPino Great for a call tomorrow or Tuesday, I probably can explain better, i.e. I don't want something hardcoded and specific for a use case, instead I mean something working with any kind of viewer. Take care, amigo

DiegoPino commented 3 years ago

@giancarlobi tomorrow Tuesday, 9:AM EST, 3:00 PM Milan, does that work? Thx!

giancarlobi commented 3 years ago

@giancarlobi tomorrow Tuesday, 9:AM EST, 3:00 PM Milan, does that work? Thx!

Perfect, amigo!

giancarlobi commented 3 years ago

@DiegoPino I was thinking about IAB and the WxH that uses as reference, the same we have to use to calculate highlighting boxes. You already updated twig template for manifest that now returns WxH as included into SBF-JSON flv:identify width and height. But identify WxH don't correspond to the ones returned by IIIF info.json so why not store info,json WxH into SBF-JSON instead of the ones returned by identify? I think that all images (jpg, tiff, PDF) are mainly managed by cantaloupe so that make sense ... or not? In addition, we can also add that WxH (info.json) into miniocr instead of the ones returned by tesseract, we can do that because coordinate values are stored as relative value. This makes all WxD (json.info, SBF-JSON. miniocr,...) consistent. Just an idea more, friend.

giancarlobi commented 3 years ago

@DiegoPino An addition here, I tried a useful tool Apache PDFBox , the same that Cantaloupe uses to convert PDF to JPG. I was able to run it by command line and using the same dpi that cantaloupe uses (see in cantaloupe configuration processor.dpi) I can retrieve by identify the SAME WxD that info.json returns without query Cantaloupe. To test:

download package here wget https://downloads.apache.org/pdfbox/2.0.21/pdfbox-app-2.0.21.jar (you need openjdk-8-jdk)
convert a PDF page to JPG by java -jar pdfbox-app-2.0.21.jar PDFToImage -imageType jpg -page 1 -dpi 150 application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf
identify the image identify application-test-139e32dc-4339-47db-ad95-a16112a7666d1.jpg application-test-139e32dc-4339-47db-ad95-a16112a7666d1.jpg JPEG 1240x1753 1240x1753+0+0 ...

Are the same dimensions as returned by querying Cantaloupe for page 1 of PDF:

https://archipelago.byterfly.eu/iiif-server/iiif/2/90a%2Fapplication-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf/info.json?page=1

@context | "http://iiif.io/api/image/2/context.json"
-- | --
@id | "https://archipelago.byte…db-ad95-a16112a7666d.pdf"
protocol | "http://iiif.io/api/image"
width | 1240
height | 1754

DiegoPino commented 3 years ago

@giancarlobi Thanks, give me a day or two to thing about the consequences of this. There are a few use cases where this may not be true (e.g cantaloupe where the max size is restricted which one can do) but yes, in general this applies and you are right. But I would prefer to keep the decimal notation in the OCR for now until we get at least one solution working completely and then we can refine and make it better and test with new code. I understand totally what you say and I agree. I just feel I'm right too tired! (really) to have a decent argument or apply changes until at least I have the search endpoints working correctly first.Hope that makes sense. Will follow up once I have more code to share but I won't forget this, no worries.

giancarlobi commented 3 years ago

@DiegoPino No rush, I wrote here as the right place for my (nightly) thoughts. I don't want change MIniOCR notation from relative (decimal) to absolute ... in absolute, sorry if I explained with wrong words. And descansa amigo, por favor.

giancarlobi commented 3 years ago

@DiegoPino I discovered that we don't need Apache PDFBox, it is more simple, as Archipelago philosophy, we only need to add a parameter to identify to have same dimensions than Cantaloupe: -density NNNxNNN where NNN is the value of cantaloupe configuration processor.dpi. Obviously, this only for PDF file. I.e. identify -density 150x150 application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf

application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[0] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[1] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[2] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[3] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[4] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[5] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[6] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[7] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.000
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[8] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.000
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[9] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.000

giancarlobi commented 3 years ago

Also, with a pipe to identify a single page: qpdf --empty --pages application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf 1 -- - | identify -density 150x150 -

giancarlobi commented 3 years ago

Also, as MiniOCR optionally can have a wh attribute with the {width} {height} values for the page, it would be useful to include same WxH as in SBF-JSON identify (and same of info-json), so calculation of absolute bbox values will be more simple in a IAB search result.

DiegoPino commented 3 years ago

@giancarlobi @pcambra this is almost done:

We still need 3 tasks to get the full pipeline

Modify https://github.com/esmero/archipelago-docker-images/blob/main/esmero-php-fpm/Dockerfile to have the missing tools @giancarlobi added for direct text extraction from PDFs instead of HOCRing them as images (default when those tools are not around). These tools are pdf2djvu, djvudump and djvu2hocr. Some of these are python tools and need to be compiled
Persist our temporary Key Values into a frictionless datapackage and attach to the source NODE/ADO once all HOCR pages are processed. This may need to go into Strawberryfield as a generic/general Frictionless datapackage processor. With adding files/extracting files capabilities. That module already has the required dependencies to deal with https://github.com/frictionlessdata/datapackage-php. Why generic? because a WACZ file is also a datapackage and for preservation needs we will want to add heavy on process, rarely needed to be accessed data to be put inside a single file.
Making sure Books made of single images can be processed. Which means also changing in Strawberry_runners our Pager Plugin.

I mentioning you both because I may need help figuring out/testing and implementing some of these things. Should I open individual issues and then make this a Macro one linked to those?

Asking for a friend

giancarlobi commented 3 years ago

@DiegoPino I was a couple of days off line to solve hardware issues and close some reports. I start to read all you done and answer asap. Take care, friend

esmero / format_strawberryfield

HOCR my old friend: enable full HOCR pipeline for IAbookreader #105

Solr doc have to store ADO reference + page reference + width and height (as returned by cantaloupe info.json) + MiniOCR

We don't have to store anything or almost anything of above into SBF-JSON (i.e. thinking a book with really many pages)

IIIF manifest has to be "hardcoded" that is, a service with really few settings, passing to the service ADO ref it returns the manifest making a query to Solr for page WxH, it could be an IIIF manifest endpoint public available

Solr doc update has to be managed by a dedicated (at the beginning, not customizable) service/flavours executed after ADO creation, this can support a yes/no option for the user or something the user decide to executed later or just after ADO ingest

Manage hOCR by zip it's a good choice but as we store all into Solr docs, zip storing could be not really needed, almost we can store into Solr a checksum to evaluate if something changes