esmero / format_strawberryfield

Set of Display formatters, extractors and utils to make Strawberry field data glow
GNU Lesser General Public License v3.0
6 stars 9 forks source link

Explore and decide on an OCR storage / indexing / retrieval strategy #11

Closed DiegoPino closed 2 months ago

DiegoPino commented 5 years ago

USE CASE

With @giancarlobi we have been exploring the most efficient way of dealing with OCR and HOCR (or Alto) in an Archipelago/strawberryfield environment. This issue is here to explore all the options and weight Pros and cons of each approach and also discuss how this affects daily operations like editing/re-ocr-ring and access.

Functional requirements for a valid solution are:

In general what we search is a combination/weighted approach that balances all this needs. We won't have a perfect solution that answers 100% all requirements but at least we can weight and discern which of those can we give a priority.

Please feel free to add comments and ideas, questions and other functional requirements. Thanks!

DiegoPino commented 5 years ago

I will start by pasting some things to add to the discussion i found while reading (email) Giancarlos great ideas and possible JSON formats

  1. https://github.com/dbmdz/solr-ocrhighlighting This goes into the devops part a bit. Its a special Solr (7 and 8) plugin that optimizes HOCR search. It depends on a Solr document per page but requires little processing on the Archipelago side.
  2. https://www.drupal.org/project/search_api_attachments. This one is interesting, to either try out or even extend or replicate to a lesser number of use cases? This allows attached files (and that is where we would need to extend strawberryfield to expose referenced files as attached!) to the search api by generating a new processor plugin.

Keep the links and ideas coming!

giancarlobi commented 5 years ago

A first JSON draft with hOCR or better word coordinates:

"hOCR": {
    "ocr_page": [
        "0 0 2481 3508 1",
        "0 0 2481 3508 2",
        "0 0 2481 3508 3",
        ...
    ],
    "word": [
        "RESEARCH 610 924 913 972 1",
        "INSTITUTE 937 924 1221 972 1",
        "ON 1245 924 1324 972 1",
        "SUSTAINABLE 1349 924 1740 972 1",
        "ECONOMIC 1762 924 2065 972 1",
        ...
        "Direttore 471 535 626 564 2",
        "Secondo 791 534 932 564 2",
        "Rolfo 944 534 1037 564 2",
        "Direzione 462 640 626 672 2",
        "CNR-IRCRES 791 639 1033 669 2",
        "Istituto 789 688 907 717 2",
        ...

        "Byterfly 1214 2738 1366 2781 3",
        "and 1388 2738 1453 2771 3",
        "enjoy 1474 2738 1575 2781 3",
        "open 1596 2748 1685 2780 3",
        "knowledge, 1705 2738 1915 2781 3",
        ...
    ]
}

ocr_page lists for each page x0, y0, xmax, ymax, page sequence number. This could be useful when we need resize/reindex word coordinates. word lists for each hOCRred word: word, x1, y1, x2, y2, page sequence number. All "word" elements could be retrieved into an array to manage and pass to BookReader. Solr: we need index "word" strings to store coordinates and a text full version of DO for searching and highlighting. The JSON above could be generated from hOCR, PDF (by Djvu tool) or DJVU. A very basic PHP to convert multiple pages hOCR into JSON above:

<?php
$xml = simplexml_load_file("ptot.html");

$hOCR = array();
$page_seq = 0;

foreach ($xml->body->children() as $page){

  $page_seq += 1;
  $hOCR['ocr_page'][] = substr($page['title'], 5) . " " . $page_seq;

  foreach ($page->children() as $line){
    foreach ($line->children() as $word){
      $hOCR['word'][] = $word . " " . substr($word['title'], 5) . " " . $page_seq;
    }
  } 
}

$hOCR_json = json_encode($hOCR, JSON_PRETTY_PRINT);
print_r($hOCR_json);
?>
DiegoPino commented 5 years ago

@giancarlobi thinking loud here and bringing our today's call to this issue

Using the reference numbers you shared with me: If a single page has 300 words, and a huge book can have 3000 pages, that leads to 900.000 entries counting all JSON "hOCR.*.word" entries. That is way too much for Solr to handle inside a single document as multivalued field values. Too much memory.

So we do need a sub entity to keep track of each page and that entity needs to be bound to the primary node. Books without OCR won't need it so i would keep the list of images inside the main JSON anyway. There are other use cases, like TECHMD (again, the idea of data-streams permeates again) where that could be something we need too.

What i fear is the burden of managing 3000 extra nodes for a single book node. Nodes are an expensive entity if we are really just using it to push to Solr and never really making any use of it. Our access to HOCR only happens really on ingest/index and edit time, but never during normal access.

And yes, we need to handle small books and large books in the same way.

Some ideas i already shared with you:

I'm willing to test an alternative approach that deals just with external data (we ingest it, but still it has no NODE in drupal) that connects back to a real node as reference. Will take me some time to build some demo plugins and test.

That deals with referenced files. I mean, files are entities and we keep track of their usage. So indexing a files text content based on mixed values from the referencing node (using strawberryfield data like the page number ) could work.

That way a strawberryfield JSON could have this structure

"hocr": [
    {
    "url": "s3://hash/hashhash-hocr-page1.html",
        "sequence": 1,
        "lang": "it"
        },
       {
    "url": "s3://hash2/hashhash-hocr-page2.html",
        "sequence": 2,
        "lang": "it"
        }
]

the url could be also another Node. I don't see why not. Its just more code.

That way we could even go forward and A) use an external HTTP source url for the HOCR (imagine a 7.x repo) or ZIP all urls and the reference the internal name via a ZIP streamwrapper zip://. I would need to come up with a solution that is small but flexible enough in that JSON structure for all the use cases.

This also leads to a new 💡 idea i just had while writing this (not about if we need more than a node, but about how we describe in the main node the HOCR)

Lets say we have a special key named "service" like the one used in our strawberryfield `as:generator' that says:

"ap:hocr": {            
           "url": "http:\/\/localhost:8001\/hocrfromzip\/hocrfornode1",
           "name": "hocr endpoint",
           "source_url": "s3://allmyhocr-hash-for-node1.zip",
           "type": "Service"
}

and that endpoint returns a list of files, or HOCR elements that we can then push to Solr? (the ap: here is to avoid mixing user generated metadata with system generated one, but for this custom work we could either namespace to IIIF or to ap, the archipelago one or even the dr for drupal.)

This could be an alternative to many needs. If we have a service endpoint, That service URL is responsible for dealing with extracting from ZIP, or processing HTML into JSON (like the manifest thing), or complementing the data with parent info, who knows what else? We can there allow plugins and other folks can extend. To be honest i have not fully explored this idea, but i prefer to paste it here before it gets lost and my brain seems to be happy with its logic 😁 . IIIF uses service a lot. Maybe it makes sense for us too because then that service can provide for people downloading this as a ZIP file (from source_url), but for us a full list of pages and words... same metadata, smaller JSON? Portability becomes and issue but we can always default to source_url when importing to a new system.

Just too many ideas. I will start coding tomorrow. Thanks for reading until this point. This is a lot!

giancarlobi commented 5 years ago

@DiegoPino I love your last but not least idea, I feel it could be the right way to deal with this and a lot of more issues. I'll try to had more brainstorming to this.

Well @DiegoPino, I think we are in the right way!!

giancarlobi commented 5 years ago

Thinking forward, how we have to build the ZIP?

DiegoPino commented 5 years ago

@giancarlobi thanks for your feedback, all your points are true, still there will be some exploring and a lot of documenting. Services will have to be exposed as such and we will have to have a way of letting json/webforms, etc know they are there for this =)

Yes to 1 file per page, as many files as pages, any file name (see my answer to that later down), i do still like your HOCR as JSON instead of the original HTML file, but we can allow both if we describe things correctly. Benefit of HOCR directly is that the user needs less processing, the benefit of the JSON is that its slimmer and we can, at archipelago, consume it right away =)

ZIP file as only format?. I thought about it because it feels its the more compact way of storing something that we don't need to access the whole time. Data will be pushed into Solr (somehow, still need to solve it!) but yes, we need to define what is inside, how to access it, etc in a simple (again, simple is so complicated sometimes) way.

I was thinking about a creating a type of manifest json file for defining what files are inside the json, etc. That manifest would need to exist in each ZIP file. That manifest, with some format we decide can tell the service what to expect, what is in there and how to use? (maybe, maybe that is too much) . That also allows us to add some basic preservation metadata there that we could also bring back (for performance needs) into the main strawberryfield json.

Services could even refer to other nodes or lists of files if it is needed, but for HOCR the ZIP idea seems simple enough.

One standard way i really like is this one (and by doing so we can integrate other communities)

https://frictionlessdata.io/docs/data-package/

We probably don't need the full spec here because our need is pretty concise but since those nice people have this https://github.com/frictionlessdata/datapackage-php we can even consume those packages. So the manifest could indeed be a data package =)

This also opens the door for something else! Datasets from science... =)

What do you think? If you like the idea (or have any other ideas) let me know. Maybe we should start by (pseudo plan)

  1. Defining what data we need to define a Service in JSON, related to what a service could need to work properly. We can start with the basics and with the hocr service, no need to define a full REST endpoint i think. Should we borrow some knowledge from IIIF for that? We know it needs at least the current node, a zip file (source), a URL (the service endpoint) and a name. But more is needed for sure.
  2. Build a simple service that just returns something based on that data (proof of concept)
  3. Let strawberryfield know that the JSON can contain services, and if so, it should call its endpoint to fetch the data or do something (i can do this, i need to research some tiny little things there)
  4. Make a prototype ZIP with a datapackage.json inside (if you like it, if not we can do something simpler)
  5. See if we can read from the ZIP, list all files and do a dummy processing on ingest or when adding a zip file to the SBF JSON (is SBF a good way to refer to == strawberryfield?) and hitting the service URL

Its a lot! But it seems so fun and useful. I'm really happy you liked the idea!

giancarlobi commented 5 years ago

I'll study this and answer you later, great!!!

giancarlobi commented 5 years ago

Yes to 1 file per page, as many files as pages, any file name (see my answer to that later down), i do still like your HOCR as JSON instead of the original HTML file, but we can allow both if we describe things correctly. Benefit of HOCR directly is that the user needs less processing, the benefit of the JSON is that its slimmer and we can, at archipelago, consume it right away =)

I think less users processing is better, we can use hOCR2JSON as the JSON returned by service or something like this.

ZIP file as only format?. I thought about it because it feels its the more compact way of storing something that we don't need to access the whole time. Data will be pushed into Solr (somehow, still need to solve it!) but yes, we need to define what is inside, how to access it, etc in a simple (again, simple is so complicated sometimes) way.

I was thinking about a creating a type of manifest json file for defining what files are inside the json, etc. That manifest would need to exist in each ZIP file. That manifest, with some format we decide can tell the service what to expect, what is in there and how to use? (maybe, maybe that is too much) . That also allows us to add some basic preservation metadata there that we could also bring back (for performance needs) into the main strawberryfield json.

Great idea, we can include a copy of main SBF json into datapackage.json ("... a descriptor MAY include any number of properties in additional to those described as required and optional properties...) other than hOCR files into resources.

Services could even refer to other nodes or lists of files if it is needed, but for HOCR the ZIP idea seems simple enough.

So, the idea is a ZIP file, within it a datapackage.json file and as many hOCR files as pages.

One standard way i really like is this one (and by doing so we can integrate other communities)

https://frictionlessdata.io/docs/data-package/

We probably don't need the full spec here because our need is pretty concise but since those nice people have this https://github.com/frictionlessdata/datapackage-php we can even consume those packages. So the manifest could indeed be a data package =)

I fully agree with this!

This also opens the door for something else! Datasets from science... =)

Now I'll study de pseudo-draft-plan.

giancarlobi commented 5 years ago

I reordered a little bit the psuedo-plan, following my thoughts: 1) ZIP file format: a datapackage.json file and a folder (hOCR?) with hOCR files which are sorted by file name == page sequence. A folder could be useful to manage hOCR and to left space to other folder (i.e. SBF JSON as backup). NB1 writing I was thinking about plain text of pages which are needed to index into Solr. Where put them? could be into ZIP, in a folder (TXT ?) and same logic as hOCR (file name order == page sequence) and file name added to datapackage.json NB2 datapackage.json must include also DO ID of parent which pages refer to 2) service:

We have to design these three points before start with a prototype IMHO. I'll try to add some more useful with next comment.

giancarlobi commented 5 years ago

ZIP file and datapackage.json We have to deal also with TXT files so I try to draft a minimal datapackage.json compliant to spec, the path syntax probably is not the right one:

{
  "parentID": 123,
  "resources": [
    {
      "sequence": 1,
      "format": "html",
      "path": "ZIP://hOCR/myhocr-hash-for-page1.html"
    },
    {
      "sequence": 2,
      "format": "html",
      "path": "ZIP://hOCR/myhocr-hash-for-page2.html"
    },
    {
      "sequence": 3,
      "format": "html",
      "path": "ZIP://hOCR/myhocr-hash-for-page3.html"
    },
    {
      "sequence": 1,
      "format": "txt",
      "path": "ZIP://TXT/mytxt-hash-for-page1.txt"
    },
    {
      "sequence": 2,
      "format": "txt",
      "path": "ZIP://TXT/mytxt-hash-for-page2.txt"
    },
    {
      "sequence": 3,
      "format": "txt",
      "path": "ZIP://TXT/mytxt-hash-for-page3.txt"
    }
  ]
}

This could be relative to a three page book and the zip file has this content: `myZipFile.zip

|---- datapackage.json

|---- hOCR

      |---- myhocr-hash-for-page1.html

      |---- myhocr-hash-for-page2.html

      |---- myhocr-hash-for-page3.html

|---- TXT

      |---- mytxt-hash-for-page1.txt

      |---- mytxt-hash-for-page2.txt

      |---- mytxt-hash-for-page3.txt`
giancarlobi commented 5 years ago

JSON response for Solr indexing We no more need page sequence number and we need plain text so a draft of basic response useful for page Solr indexing could be:

{
  "hOCRTXT": {
    "parentID": 123,
    "fulltext": "Here the full plain text of this page...",
    "ocr_page": "0 0 2481 3508",
    "word": [
      "RESEARCH 610 924 913 972",
      "INSTITUTE 937 924 1221 972",
      "ON 1245 924 1324 972",
      "SUSTAINABLE 1349 924 1740 972",
      "ECONOMIC 1762 924 2065 972"
    ]
  }
}
giancarlobi commented 5 years ago
{
  "hOCRTXT": {
    "parentID": 123,
    "fulltext": "Here the full plain text of this page...",
    "ocr_page": "0 0 2481 3508",
    "word": [
      "RESEARCH 610 924 913 972",
      "INSTITUTE 937 924 1221 972",
      "ON 1245 924 1324 972",
      "SUSTAINABLE 1349 924 1740 972",
      "ECONOMIC 1762 924 2065 972"
    ]
  }
}

Here we need also an ID to assign each page to index in Solr. Can we use something like a simple parentID+number? is it better an hash value?

DiegoPino commented 5 years ago

@giancarlobi so much good stuff here. I'm still busy with Thursday duties (normally my very own dooms-day every week) but i have some ideas to expand on the discussion.

we start with a ReadOnly service, right? Write/edit function will be added later, how and where have to be discussed

Read only. I have another idea for putting files. Will discuss this later, i really don't want to expose security concerns via services exposed in the metadata. those services are like an embedded JSON graph in that position. Only difference the graph is not "rendered" immediately 😄

service response hOCR2JSON don't have to include page sequence number while we need plaintext to have all data needed for Solr indexing I think having the page number can help. We can still attach it to the Solr document and use if for ordering. Does that make sense? I need to explain myself better here.. (after my calls!)

SBF JSON: we need a very simple Json piece, I think we don't need DO ID, so service url, ZIP file name/place/url and name and type (as into @DiegoPino mail).

yes, true! Looking at https://www.w3.org/TR/activitystreams-vocabulary/#dfn-object (and see services there too since the object properties apply to services) we just need a few properties.

About the ZIP itself. I think whatever goes into the data package manifest needs to be minimal and mostly about the files inside the ZIP only, kinda standalone and self sustainable. We could push the metadata about the node (full STB json there too) but without Drupal connecting things will make little sense for anyone just looking at the ZIP and the JSON and since we want to allow people to add a ZIP file without knowing in what NODE UUID, it will end, we would be responsible of adding that data ourselves after ingest or on edits!

I still feel the idea of a star applies here. main STB JSON is the center, everything else is referenced from it, but does not references back. That way our ingest workflows and also duties needed to keep consistency (imagine moving, migrating, etc binaries) can be simplified. Not saying i'm closed to the idea, i just feel the simpler the ZIP, or said in a different way, the simpler in this stage of development, the easier and decoupled our solution can be.

About

We have to deal also with TXT files so I try to draft a minimal datapackage.json compliant to spec, the path syntax probably is not the right one:

Do you think converting JSON into text on index could help us avoid adding the TXT? I would be basically iterating over 300-400 words? That would allow us to allow people (i know its hard, but we can invent a way) to fix their HOCR via UI in the future

That is all for now, thanks so much and sorry for the unordered response, have to get into another call 😅

giancarlobi commented 5 years ago

About Solr and page indexing

service response hOCR2JSON don't have to include page sequence number while we need plaintext to have all data needed for Solr indexing

I think having the page number can help. We can still attach it to the Solr document and use if for ordering. Does that make sense? I need to explain myself better here.. (after my calls!)

IMHO as we have a Solr document for each page, why we need page sequence here? in addition, if page sequence change then we need to change also each "word" lines? or I was not clear, I meant we don't need page sequence into "word" string, at the end, while we need sequence page number into datapackage.json to order pages.

We have to deal also with TXT files so I try to draft a minimal datapackage.json compliant to spec, the path syntax probably is not the right one:

Do you think converting JSON into text on index could help us avoid adding the TXT? I would be basically iterating over 300-400 words? That would allow us to allow people (i know its hard, but we can invent a way) to fix their HOCR via UI in the future

Well, I think could be a really good idea. The plain text is needed to search and for highlighting, if we can rebuild plain text joining single "word" from Json, I think we have the same result or almost the same without blank lines that probably are not so important. We have to check it.

giancarlobi commented 5 years ago

About the ZIP itself. I think whatever goes into the data package manifest needs to be minimal and mostly about the files inside the ZIP only, kinda standalone and self sustainable. We could push the metadata about the node (full STB json there too) but without Drupal connecting things will make little sense for anyone just looking at the ZIP and the JSON and since we want to allow people to add a ZIP file without knowing in what NODE UUID, it will end, we would be responsible of adding that data ourselves after ingest or on edits!

I fully agree and the previous assumption (no more txt into ZIP) makes it more simple.

I still feel the idea of a star applies here. main STB JSON is the center, everything else is referenced from it, but does not references back. That way our ingest workflows and also duties needed to keep consistency (imagine moving, migrating, etc binaries) can be simplified. Not saying i'm closed to the idea, i just feel the simpler the ZIP, or said in a different way, the simpler in this stage of development, the easier and decoupled our solution can be.

If I correctly understand, no "parent ID" into datapackage.json because parent STB JSON already includes ZIP reference, so datapackage.json at this stage could be only a list of "resources" hOCR (and without TXT) like:

{
  "resources": [
    {
      "sequence": 1,
      "format": "html",
      "path": "ZIP://hOCR/myhocr-hash-for-page1.html"
    },
    {
      "sequence": 2,
      "format": "html",
      "path": "ZIP://hOCR/myhocr-hash-for-page2.html"
    },
    {
      "sequence": 3,
      "format": "html",
      "path": "ZIP://hOCR/myhocr-hash-for-page3.html"
    }
  ]
}
DiegoPino commented 5 years ago

Hola Giancarlo, i think i missed reading something, sorry, my fault.

IMHO as we have a Solr document for each page, why we need page sequence here? in addition, if page sequence change then we need to change also each "word" lines? or I was not clear, I meant we don't need page sequence into "word" string, at the end, while we need sequence page number into datapackage.json to order pages.

Yes, you are right, i wrote it incorrectly! We don't need the page in each word. Totally true. Sorry i was distracted. I was trying to say we just need a sequence, a single sequence in each Solr document in case we need to order them.

If I correctly understand, no "parent ID" into datapackage.json because parent STB JSON already includes ZIP reference, so datapackage.json at this stage could be only a list of "resources" hOCR (and without TXT) like:

Yes, that is what i mean. Make moving the ZIP file easier, migrating or reusing somewhere else. We could maintain the book title? Or some metadata if people want to understand to what it refers? What do you think?

giancarlobi commented 5 years ago

Hola Giancarlo, i think i missed reading something, sorry, my fault.

IMHO as we have a Solr document for each page, why we need page sequence here? in addition, if page sequence change then we need to change also each "word" lines? or I was not clear, I meant we don't need page sequence into "word" string, at the end, while we need sequence page number into datapackage.json to order pages.

Yes, you are right, i wrote it incorrectly! We don't need the page in each word. Totally true. Sorry i was distracted. I was trying to say we just need a sequence, a single sequence in each Solr document in case we need to order them.

Don't worry, also my fault, a lot of words and my English not so good ... now it's all clear and I fully agree

If I correctly understand, no "parent ID" into datapackage.json because parent STB JSON already includes ZIP reference, so datapackage.json at this stage could be only a list of "resources" hOCR (and without TXT) like:

Yes, that is what i mean. Make moving the ZIP file easier, migrating or reusing somewhere else. We could maintain the book title? Or some metadata if people want to understand to what it refers? What do you think?

Well, book title could be useful for users to have a some type of "human" reference, I think the code has to add title to the datapackage.json (as other entries) to make user life more simple.

DiegoPino commented 5 years ago

Quick update on some research. Planning on using https://www.drupal.org/docs/8/modules/search-api/developer-documentation/available-plugin-types#datasources this plugins to expose the HOCR, TechMD, etc directly to Solr and bind them to the nodes. @giancarlobi 👀 email with some links and ideas went that way. This should lower the number of lines of code and the need for more entities in our system for a LOT

DiegoPino commented 5 years ago

@giancarlobi i will try putting our today's agreement on how we will proceed with this in simple words, based on your research and current work. Please correct me if something is missing or wrong

  1. We will create a Seasoner (Embedded JSON based Service) D8 Service that will deal with
    • Defining how a service will be written (shape and content) into the SBF JSON (Open API)
    • Managing all possible Seasoners as custom plugins.
    • Reading/extracting data from the attached ZIP files or any other source each one expose
    • Exposing Discovery functionality as an internal API endpoint for other services consumption
    • The Seasoner will also cache information to avoid double processing and accessing ZIP files many times. As a side note related to the last point, the JSON embedded Seasoner will keep track of its Binary/ZIP source Hash to avoid processing data if the file was not changed.
  2. We will (and are already!) create a Search API DataSource plugin that will provide this extra Solr documents (Example, Book of 100 pages = for 1 Node, 100 extra ones with HOCR, TECHMD for each page). This DataSource Plugin will invoke, for each Node that contains a SBF, the Seasoner Service, which in turn will provide how many documents (if any) and the data that is needed for creating one Search API Item. We will name this type of Documents "Flavors"
  3. We will create a Tracker Class for this Datasources and give them also a config form.
  4. We will create one Search API processor plugin for each type of Seasoner (which i guess will be also a custom type of plugin) we will manage that will process/invoke the Seasoner, get the data, extract/massage and add as a field to each Flavor. The processors will do the automatic generation of Solr fields based on what each Seasoner can provide.

We then will have some coffee (ristretto & americano) and test this whole thing until its performant, fast and awesome.

I already marked some of the Checkboxes on top since this agreed on architecture covers those already

I will work on the Seasoner services and a demo processor while you figure out the Datasource (definition and actual indexing into Solr)

This is great! Thanks so much

giancarlobi commented 3 years ago

@DiegoPino I think this is the right place where (re)start talking about hOCR. I installed this (https://dbmdz.github.io/solr-ocrhighlighting/) plugin into my Solr 8.7 and it seems to work. I inserted a doc from hOCR file and query look good. A first note: actual hOCR files I have on I7 need a makeup because plugin want ocr_page tag with and ID or a ppageno property to work. I checked that Tesseract adds ID and ppageno to ocr_page while divu2hocr doesn't. Well, I continue testing this, I'd like to understand if it's best (for Archipelago) one doc per page or multiple pages per doc.

DiegoPino commented 3 years ago

@giancarlobi thanks. Please let me know how this moves forward and how your experiments go. If HOCR still needs to be updated then i will at least advance on my side with the frictionless data package/SBR option. HOCR is normally small in size but we may need to have full control. Thanks!

giancarlobi commented 3 years ago

@DiegoPino The hOCR could be managed with xmlstarlet, i.e. with this xmlstarlet ed -u '//_:div[@class="ocr_page"]/@title' -v "$(xmlstarlet sel -t -v '//_:div[@class="ocr_page"]/@title' p1.html); ppageno 0" p1.html | xmlstarlet ed -a '//_:div[@class="ocr_page"]' -t attr -n id -v page_001 > p1.xml you add id and ppageno to p1.html and output to p1.xml. Also I see plugin uses ppageno instead of id if both are present, so we need only one between ppageno or id. Also, ppageno must be 0 for front cover (see http://kba.cloud/hocr-spec/1.2/#ppageno). Now (or better tomorrow) more check with this.

giancarlobi commented 3 years ago

@DiegoPino for your info, a query to solr like this: http://solr.server.url/solr/archipelago/select?hl.ocr.fl=ocr_text&hl=true&q=ocr_text%3Aissn&hl.ocr.absoluteHighlights=on produce this:

{
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"ocrdoc-1",
        "ocr_text":"/mnt/p1.xml",
        "timestamp":"2020-11-15T21:59:51.401Z",
        "_version_":1683465270885089280}]
  },
  "highlighting":{
    "ocrdoc-1":{}},
  "ocrHighlighting":{
    "ocrdoc-1":{
      "ocr_text":{
        "snippets":[{
            "text":"and enjoy open knowledge GIANCARLO BIRELLO, ANNA PERIN <em>ISSN</em> (print): 2421-5783 <em>ISSN</em> (on line): 2421-5562 Rapporto Tecnico",
            "score":1461.3793,
            "pages":[{
                "id":"0",
                "width":2481,
                "height":3508}],
            "regions":[{
                "ulx":1461,
                "uly":2757,
                "lrx":2295,
                "lry":2959,
                "text":"and enjoy open knowledge GIANCARLO BIRELLO, ANNA PERIN",
                "pageIdx":0},
              {
                "ulx":740,
                "uly":171,
                "lrx":2344,
                "lry":1836,
                "text":"<em>ISSN</em> (print): 2421-5783 <em>ISSN</em> (on line): 2421-5562 Rapporto Tecnico",
                "pageIdx":0}],
            "highlights":[[{
                  "ulx":1788,
                  "uly":174,
                  "lrx":1895,
                  "lry":213,
                  "text":"ISSN",
                  "parentRegionIdx":1}],
              [{
                  "ulx":1741,
                  "uly":244,
                  "lrx":1848,
                  "lry":283,
                  "text":"ISSN",
                  "parentRegionIdx":1}]]}],
        "numTotal":1}}},
  "highlighting":{}}
DiegoPino commented 3 years ago

Looks good. Still i do not see (still) how we could allow Drupal to find the values/connect to the original Entity (no data source). So this would require a custom Solr Query to be executed outside of the normal Drupal way right? Or we would need to add an ID that allows us manually make the logic to connect? Good tests!

giancarlobi commented 3 years ago

We can use your logic as doc ID ingested into Solr as for PDF test runners. In addition to file path, we can add everything we need. Is this the answer to your question?

giancarlobi commented 3 years ago

I ingested into Solr using this json in POST: { "id": "ocrdoc-1", "ocr_text": "/mnt/p1.xml" }

giancarlobi commented 3 years ago

Also, for info, to install plugin I copied downloaded solr-ocrhighlighting-0.5.0.jar into /opt/solr/contrib/ocrsearch/lib/ then modified conf file (starting from your last solr 8 conf):

diff solrconfig.xml solrconfig.xml.ORI
90,93d89
<   <lib dir="${solr.install.dir:../../../..}/contrib/ocrsearch/lib" regex=".*\.jar" />
<
<
<
536,540d531
<
<
<   <!-- Add a new named search component that takes care of highlighting OCR field values. -->
<   <searchComponent class="de.digitalcollections.solrocr.solr.OcrHighlightComponent" name="ocrHighlight" />
<

diff schema_extra_fields.xml schema_extra_fields.xml.ORI
91,92d90
<
< <field name="ocr_text" type="text_ocr" multiValued="false" indexed="true" stored="true" />

diff schema_extra_types.xml schema_extra_types.xml.ORI
217,235d216
< <!--
<   ocrHighlight
<   0.5.0
< -->
< <fieldtype name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
<   <analyzer type="index">
<     <charFilter class="de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory" />
<     <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
<     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<     <filter class="solr.LowerCaseFilterFactory"/>
<     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
<   </analyzer>
<   <analyzer type="query">
<     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<     <filter class="solr.LowerCaseFilterFactory"/>
<     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
<   </analyzer>
< </fieldtype>
<
DiegoPino commented 3 years ago

I like the POST ingest. So this means once ingested I do not need the file anymore? Can I remove it or does the plugin still need access to it via local storage? Also means if this is a POST I can simply put all the other data I need from the datasource into that...

giancarlobi commented 3 years ago

The plugin need to access the file via local storage.

DiegoPino commented 3 years ago

Ok. So this won't work for S3 storage then. Still good to explore for people with the resources like you to run all on local, or, for someone with a mixed use case where we can have at least the OCR locally available... (wonder how much a 1,000,000 HOCR would take in space.. maybe its not much)

Will also ask Johannes if there is some planned remote fetch? maybe the file can in the future be a URL

giancarlobi commented 3 years ago

But the space into Solr is really reduced rispect to index complet hocr with every word + coordinates

giancarlobi commented 3 years ago

Really S3 doesn't allow that? opsssssss

DiegoPino commented 3 years ago

yes. I agree. It is one thing or the other. Either Solr taking the space or filesystem. Makes sense.

DiegoPino commented 3 years ago

Not from OS Directly. In PHP it is treated as local (in the sense of it allows file operations, stats, cp, etc, because it is a streamwrapper) but Linux needs drivers for that (s3fs) https://www.nakivo.com/blog/mount-amazon-s3-as-a-drive-how-to-guide/ Which yes, could help

giancarlobi commented 3 years ago

Well, we don't have to use this plugin, we can take idea from this. Better if we can talk about this in a conf next week, amigo.

DiegoPino commented 3 years ago

I like the plugin! Feel one thing less to code and I like that. Yes, let's test more. I can see this as a better option in many cases

giancarlobi commented 3 years ago

Ok, more test and we discuss it next talk.

giancarlobi commented 3 years ago

@DiegoPino some few more steps. 1) I mounted the data shared folder on Solr VM using sshfs as for Archipelago VM 2) Wrote a bash script to create hOCR file (with page ID and ppageno) for each page of a pdf file (could be better, in the meantime it works):

#!/bin/bash

NPAGES=$(qpdf --show-npages $1)
echo "Pages number: "$NPAGES

pdf2djvu --no-metadata -j0 --guess-dpi -o full.djv $1
echo "DJVU file created"

PAGE=1

while [ $PAGE -le $NPAGES ]
do
   echo "Page: "$PAGE
   PPAGENO=$(( $PAGE - 1))
   djvu2hocr -p $PAGE full.djv | xmlstarlet fo -D | xmlstarlet ed -a '//_:div[@class="ocr_page"]' -t attr -n id -v 'page_'$PAGE | xmlstarlet ed -u '//_:div[@class="ocr_page"]/@title' -v "$(djvu2hocr -p "$PAGE" full.djv | xmlstarlet fo -D | xmlstarlet sel -t -v '//_:div[@class="ocr_page"]/@title'); ppageno "$PPAGENO > page_$PAGE.xml
   PAGE=$(( $PAGE + 1 ))
done

3) check a Solr upload with multiple pages as :

{
    "id": "ocrdoc-1",
    "ocr_text": "/mnt/archicantadata/hocrtest/page_1.xml+/mnt/archicantadata/hocrtest/page_2.xml+/mnt/archicantadata/hocrtest/page_3.xml"
}

but I'm not sure this is a good solution: less doc into Solr but with 1000 pages how long line and does it overflow??

Well more test now!

DiegoPino commented 3 years ago

Great stuff. Yes. I wonder how much gain it is. Probably more tests on your side will help. I will think about this. Also I do not think I have djvu2hocr and pdf2djvu in my docker container, should add just in case!

giancarlobi commented 3 years ago

@DiegoPino I started playing with MiniOCR, here a first pretty simple script to convert from hOCR (djvu output) to MiniOCR, next step will be check update Solr with this as inline doc.

<?php
$val = getopt("i:p:");

$xml = simplexml_load_file($val['i']);

echo '<ocr>' . "\n";
foreach ($xml->body->children() as $page){
  $coos = explode(" ", substr($page['title'], 5));
  echo '<p id="' . $val['p'] . '" wh="' . $coos[2] . " " . $coos[3] . '">' . "\n";
  echo '<b>' . "\n";
  foreach ($page->children() as $line){
    echo '<l>';
    foreach ($line->children() as $word){
        $wcoos = explode(" ", $word['title']);
        echo '<w x="' . $wcoos[1] . ' ' . $wcoos[2] . ' ' . $wcoos[3] . ' ' . $wcoos[4] . '">' . $word . '</w> '; 
    }
    echo '</l>' . "\n";
  }
  echo '</b>' . "\n";
  echo '</p>' . "\n";
}
echo '</ocr>' . "\n";
?>
DiegoPino commented 3 years ago

Great. I think this is something we will have to concede. I liked our JSON representation better but this one works with the plugin so let's go with it!

DiegoPino commented 2 months ago

Closing as resolved!