Closed DiegoPino closed 2 months ago
I will start by pasting some things to add to the discussion i found while reading (email) Giancarlos great ideas and possible JSON formats
Keep the links and ideas coming!
A first JSON draft with hOCR or better word coordinates:
"hOCR": {
"ocr_page": [
"0 0 2481 3508 1",
"0 0 2481 3508 2",
"0 0 2481 3508 3",
...
],
"word": [
"RESEARCH 610 924 913 972 1",
"INSTITUTE 937 924 1221 972 1",
"ON 1245 924 1324 972 1",
"SUSTAINABLE 1349 924 1740 972 1",
"ECONOMIC 1762 924 2065 972 1",
...
"Direttore 471 535 626 564 2",
"Secondo 791 534 932 564 2",
"Rolfo 944 534 1037 564 2",
"Direzione 462 640 626 672 2",
"CNR-IRCRES 791 639 1033 669 2",
"Istituto 789 688 907 717 2",
...
"Byterfly 1214 2738 1366 2781 3",
"and 1388 2738 1453 2771 3",
"enjoy 1474 2738 1575 2781 3",
"open 1596 2748 1685 2780 3",
"knowledge, 1705 2738 1915 2781 3",
...
]
}
ocr_page
lists for each page x0, y0, xmax, ymax, page sequence number. This could be useful when we need resize/reindex word coordinates.
word
lists for each hOCRred word: word, x1, y1, x2, y2, page sequence number. All "word" elements could be retrieved into an array to manage and pass to BookReader.
Solr: we need index "word" strings to store coordinates and a text full version of DO for searching and highlighting.
The JSON above could be generated from hOCR, PDF (by Djvu tool) or DJVU.
A very basic PHP to convert multiple pages hOCR into JSON above:
<?php
$xml = simplexml_load_file("ptot.html");
$hOCR = array();
$page_seq = 0;
foreach ($xml->body->children() as $page){
$page_seq += 1;
$hOCR['ocr_page'][] = substr($page['title'], 5) . " " . $page_seq;
foreach ($page->children() as $line){
foreach ($line->children() as $word){
$hOCR['word'][] = $word . " " . substr($word['title'], 5) . " " . $page_seq;
}
}
}
$hOCR_json = json_encode($hOCR, JSON_PRETTY_PRINT);
print_r($hOCR_json);
?>
@giancarlobi thinking loud here and bringing our today's call to this issue
Using the reference numbers you shared with me: If a single page has 300 words, and a huge book can have 3000 pages, that leads to 900.000 entries counting all JSON "hOCR.*.word" entries. That is way too much for Solr to handle inside a single document as multivalued field values. Too much memory.
So we do need a sub entity to keep track of each page and that entity needs to be bound to the primary node. Books without OCR won't need it so i would keep the list of images inside the main JSON anyway. There are other use cases, like TECHMD (again, the idea of data-streams permeates again) where that could be something we need too.
What i fear is the burden of managing 3000 extra nodes for a single book node. Nodes are an expensive entity if we are really just using it to push to Solr and never really making any use of it. Our access to HOCR only happens really on ingest/index and edit time, but never during normal access.
And yes, we need to handle small books and large books in the same way.
Some ideas i already shared with you:
class SolrDocument extends DatasourcePluginBase implements PluginFormInterface {
I'm willing to test an alternative approach that deals just with external data (we ingest it, but still it has no NODE in drupal) that connects back to a real node as reference. Will take me some time to build some demo plugins and test.
class ContentEntity extends DatasourcePluginBase implements EntityDatasourceInterface, PluginFormInterface {
That deals with referenced files. I mean, files are entities and we keep track of their usage. So indexing a files
text content based on mixed values from the referencing node (using strawberryfield data like the page number ) could work.
That way a strawberryfield JSON could have this structure
"hocr": [
{
"url": "s3://hash/hashhash-hocr-page1.html",
"sequence": 1,
"lang": "it"
},
{
"url": "s3://hash2/hashhash-hocr-page2.html",
"sequence": 2,
"lang": "it"
}
]
the url could be also another Node. I don't see why not. Its just more code.
That way we could even go forward and A) use an external HTTP source url for the HOCR (imagine a 7.x repo) or ZIP all urls and the reference the internal name via a ZIP streamwrapper zip://. I would need to come up with a solution that is small but flexible enough in that JSON structure for all the use cases.
This also leads to a new 💡 idea i just had while writing this (not about if we need more than a node, but about how we describe in the main node the HOCR)
Lets say we have a special key named "service" like the one used in our strawberryfield `as:generator' that says:
"ap:hocr": {
"url": "http:\/\/localhost:8001\/hocrfromzip\/hocrfornode1",
"name": "hocr endpoint",
"source_url": "s3://allmyhocr-hash-for-node1.zip",
"type": "Service"
}
and that endpoint returns a list of files, or HOCR elements that we can then push to Solr?
(the ap:
here is to avoid mixing user generated metadata with system generated one, but for this custom work we could either namespace to IIIF or to ap
, the archipelago one or even the dr
for drupal.)
This could be an alternative to many needs. If we have a service
endpoint, That service URL is responsible for dealing with extracting from ZIP, or processing HTML into JSON (like the manifest thing), or complementing the data with parent info, who knows what else? We can there allow plugins and other folks can extend. To be honest i have not fully explored this idea, but i prefer to paste it here before it gets lost and my brain seems to be happy with its logic 😁 . IIIF uses service
a lot. Maybe it makes sense for us too because then that service can provide for people downloading this as a ZIP file (from source_url
), but for us a full list of pages and words... same metadata, smaller JSON? Portability becomes and issue but we can always default to source_url when importing to a new system.
Just too many ideas. I will start coding tomorrow. Thanks for reading until this point. This is a lot!
@DiegoPino I love your last but not least idea, I feel it could be the right way to deal with this and a lot of more issues. I'll try to had more brainstorming to this.
service
endpoints it's a really fast, simple and elegant manner to manage these kinds of needsap:hocr
to Desciptive Metadata so JSON philosophy is safeWell @DiegoPino, I think we are in the right way!!
Thinking forward, how we have to build the ZIP?
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
</head>
<body>
<div class="ocr_page" title="bbox 0 0 2481 3508">
<span class="ocrx_line" title="bbox 521 853 2349 901">
<span class="ocrx_word" title="bbox 521 853 766 901">ISTITUTO</span>
<span class="ocrx_word" title="bbox 787 853 840 900">DI</span>
<span class="ocrx_word" title="bbox 864 853 1109 901">RICERCA</span>
...
</span>
<span class="ocrx_line" title="bbox 610 924 2346 972">
<span class="ocrx_word" title="bbox 610 924 913 972">RESEARCH</span>
<span class="ocrx_word" title="bbox 937 924 1221 972">INSTITUTE</span>
<span class="ocrx_word" title="bbox 1245 924 1324 972">ON</span>
...
</span>
...
</div>
</body>
</html>
@giancarlobi thanks for your feedback, all your points are true, still there will be some exploring and a lot of documenting. Services will have to be exposed as such and we will have to have a way of letting json/webforms, etc know they are there for this =)
Yes to 1 file per page, as many files as pages, any file name (see my answer to that later down), i do still like your HOCR as JSON instead of the original HTML file, but we can allow both if we describe things correctly. Benefit of HOCR directly is that the user needs less processing, the benefit of the JSON is that its slimmer and we can, at archipelago, consume it right away =)
ZIP file as only format?. I thought about it because it feels its the more compact way of storing something that we don't need to access the whole time. Data will be pushed into Solr (somehow, still need to solve it!) but yes, we need to define what is inside, how to access it, etc in a simple (again, simple is so complicated sometimes) way.
I was thinking about a creating a type of manifest json file for defining what files are inside the json, etc. That manifest would need to exist in each ZIP file. That manifest, with some format we decide can tell the service what to expect, what is in there and how to use? (maybe, maybe that is too much) . That also allows us to add some basic preservation metadata there that we could also bring back (for performance needs) into the main strawberryfield json.
Services could even refer to other nodes or lists of files if it is needed, but for HOCR the ZIP idea seems simple enough.
One standard way i really like is this one (and by doing so we can integrate other communities)
https://frictionlessdata.io/docs/data-package/
We probably don't need the full spec here because our need is pretty concise but since those nice people have this https://github.com/frictionlessdata/datapackage-php we can even consume those packages. So the manifest could indeed be a data package =)
This also opens the door for something else! Datasets from science... =)
What do you think? If you like the idea (or have any other ideas) let me know. Maybe we should start by (pseudo plan)
Its a lot! But it seems so fun and useful. I'm really happy you liked the idea!
I'll study this and answer you later, great!!!
Yes to 1 file per page, as many files as pages, any file name (see my answer to that later down), i do still like your HOCR as JSON instead of the original HTML file, but we can allow both if we describe things correctly. Benefit of HOCR directly is that the user needs less processing, the benefit of the JSON is that its slimmer and we can, at archipelago, consume it right away =)
I think less users processing is better, we can use hOCR2JSON as the JSON returned by service or something like this.
ZIP file as only format?. I thought about it because it feels its the more compact way of storing something that we don't need to access the whole time. Data will be pushed into Solr (somehow, still need to solve it!) but yes, we need to define what is inside, how to access it, etc in a simple (again, simple is so complicated sometimes) way.
I was thinking about a creating a type of manifest json file for defining what files are inside the json, etc. That manifest would need to exist in each ZIP file. That manifest, with some format we decide can tell the service what to expect, what is in there and how to use? (maybe, maybe that is too much) . That also allows us to add some basic preservation metadata there that we could also bring back (for performance needs) into the main strawberryfield json.
Great idea, we can include a copy of main SBF json into datapackage.json ("... a descriptor MAY include any number of properties in additional to those described as required and optional properties...) other than hOCR files into resources.
Services could even refer to other nodes or lists of files if it is needed, but for HOCR the ZIP idea seems simple enough.
So, the idea is a ZIP file, within it a datapackage.json file and as many hOCR files as pages.
One standard way i really like is this one (and by doing so we can integrate other communities)
https://frictionlessdata.io/docs/data-package/
We probably don't need the full spec here because our need is pretty concise but since those nice people have this https://github.com/frictionlessdata/datapackage-php we can even consume those packages. So the manifest could indeed be a data package =)
I fully agree with this!
This also opens the door for something else! Datasets from science... =)
Now I'll study de pseudo-draft-plan.
I reordered a little bit the psuedo-plan, following my thoughts: 1) ZIP file format: a datapackage.json file and a folder (hOCR?) with hOCR files which are sorted by file name == page sequence. A folder could be useful to manage hOCR and to left space to other folder (i.e. SBF JSON as backup). NB1 writing I was thinking about plain text of pages which are needed to index into Solr. Where put them? could be into ZIP, in a folder (TXT ?) and same logic as hOCR (file name order == page sequence) and file name added to datapackage.json NB2 datapackage.json must include also DO ID of parent which pages refer to 2) service:
We have to design these three points before start with a prototype IMHO. I'll try to add some more useful with next comment.
ZIP file and datapackage.json We have to deal also with TXT files so I try to draft a minimal datapackage.json compliant to spec, the path syntax probably is not the right one:
{
"parentID": 123,
"resources": [
{
"sequence": 1,
"format": "html",
"path": "ZIP://hOCR/myhocr-hash-for-page1.html"
},
{
"sequence": 2,
"format": "html",
"path": "ZIP://hOCR/myhocr-hash-for-page2.html"
},
{
"sequence": 3,
"format": "html",
"path": "ZIP://hOCR/myhocr-hash-for-page3.html"
},
{
"sequence": 1,
"format": "txt",
"path": "ZIP://TXT/mytxt-hash-for-page1.txt"
},
{
"sequence": 2,
"format": "txt",
"path": "ZIP://TXT/mytxt-hash-for-page2.txt"
},
{
"sequence": 3,
"format": "txt",
"path": "ZIP://TXT/mytxt-hash-for-page3.txt"
}
]
}
This could be relative to a three page book and the zip file has this content: `myZipFile.zip
|---- datapackage.json
|---- hOCR
|---- myhocr-hash-for-page1.html
|---- myhocr-hash-for-page2.html
|---- myhocr-hash-for-page3.html
|---- TXT
|---- mytxt-hash-for-page1.txt
|---- mytxt-hash-for-page2.txt
|---- mytxt-hash-for-page3.txt`
JSON response for Solr indexing We no more need page sequence number and we need plain text so a draft of basic response useful for page Solr indexing could be:
{
"hOCRTXT": {
"parentID": 123,
"fulltext": "Here the full plain text of this page...",
"ocr_page": "0 0 2481 3508",
"word": [
"RESEARCH 610 924 913 972",
"INSTITUTE 937 924 1221 972",
"ON 1245 924 1324 972",
"SUSTAINABLE 1349 924 1740 972",
"ECONOMIC 1762 924 2065 972"
]
}
}
{ "hOCRTXT": { "parentID": 123, "fulltext": "Here the full plain text of this page...", "ocr_page": "0 0 2481 3508", "word": [ "RESEARCH 610 924 913 972", "INSTITUTE 937 924 1221 972", "ON 1245 924 1324 972", "SUSTAINABLE 1349 924 1740 972", "ECONOMIC 1762 924 2065 972" ] } }
Here we need also an ID to assign each page to index in Solr. Can we use something like a simple parentID+number? is it better an hash value?
@giancarlobi so much good stuff here. I'm still busy with Thursday duties (normally my very own dooms-day every week) but i have some ideas to expand on the discussion.
we start with a ReadOnly service, right? Write/edit function will be added later, how and where have to be discussed
Read only. I have another idea for putting files. Will discuss this later, i really don't want to expose security concerns via services exposed in the metadata. those services are like an embedded JSON graph in that position. Only difference the graph is not "rendered" immediately 😄
service response hOCR2JSON don't have to include page sequence number while we need plaintext to have all data needed for Solr indexing I think having the page number can help. We can still attach it to the Solr document and use if for ordering. Does that make sense? I need to explain myself better here.. (after my calls!)
SBF JSON: we need a very simple Json piece, I think we don't need DO ID, so service url, ZIP file name/place/url and name and type (as into @DiegoPino mail).
yes, true! Looking at https://www.w3.org/TR/activitystreams-vocabulary/#dfn-object (and see services there too since the object properties apply to services) we just need a few properties.
About the ZIP itself. I think whatever goes into the data package manifest needs to be minimal and mostly about the files inside the ZIP only, kinda standalone and self sustainable. We could push the metadata about the node (full STB json there too) but without Drupal connecting things will make little sense for anyone just looking at the ZIP and the JSON and since we want to allow people to add a ZIP file without knowing in what NODE UUID, it will end, we would be responsible of adding that data ourselves after ingest or on edits!
I still feel the idea of a star applies here. main STB JSON is the center, everything else is referenced from it, but does not references back. That way our ingest workflows and also duties needed to keep consistency (imagine moving, migrating, etc binaries) can be simplified. Not saying i'm closed to the idea, i just feel the simpler the ZIP, or said in a different way, the simpler in this stage of development, the easier and decoupled our solution can be.
About
We have to deal also with TXT files so I try to draft a minimal datapackage.json compliant to spec, the path syntax probably is not the right one:
Do you think converting JSON into text on index could help us avoid adding the TXT? I would be basically iterating over 300-400 words? That would allow us to allow people (i know its hard, but we can invent a way) to fix their HOCR via UI in the future
That is all for now, thanks so much and sorry for the unordered response, have to get into another call 😅
About Solr and page indexing
service response hOCR2JSON don't have to include page sequence number while we need plaintext to have all data needed for Solr indexing
I think having the page number can help. We can still attach it to the Solr document and use if for ordering. Does that make sense? I need to explain myself better here.. (after my calls!)
IMHO as we have a Solr document for each page, why we need page sequence here? in addition, if page sequence change then we need to change also each "word" lines? or I was not clear, I meant we don't need page sequence into "word" string, at the end, while we need sequence page number into datapackage.json to order pages.
We have to deal also with TXT files so I try to draft a minimal datapackage.json compliant to spec, the path syntax probably is not the right one:
Do you think converting JSON into text on index could help us avoid adding the TXT? I would be basically iterating over 300-400 words? That would allow us to allow people (i know its hard, but we can invent a way) to fix their HOCR via UI in the future
Well, I think could be a really good idea. The plain text is needed to search and for highlighting, if we can rebuild plain text joining single "word" from Json, I think we have the same result or almost the same without blank lines that probably are not so important. We have to check it.
About the ZIP itself. I think whatever goes into the data package manifest needs to be minimal and mostly about the files inside the ZIP only, kinda standalone and self sustainable. We could push the metadata about the node (full STB json there too) but without Drupal connecting things will make little sense for anyone just looking at the ZIP and the JSON and since we want to allow people to add a ZIP file without knowing in what NODE UUID, it will end, we would be responsible of adding that data ourselves after ingest or on edits!
I fully agree and the previous assumption (no more txt into ZIP) makes it more simple.
I still feel the idea of a star applies here. main STB JSON is the center, everything else is referenced from it, but does not references back. That way our ingest workflows and also duties needed to keep consistency (imagine moving, migrating, etc binaries) can be simplified. Not saying i'm closed to the idea, i just feel the simpler the ZIP, or said in a different way, the simpler in this stage of development, the easier and decoupled our solution can be.
If I correctly understand, no "parent ID" into datapackage.json because parent STB JSON already includes ZIP reference, so datapackage.json at this stage could be only a list of "resources" hOCR (and without TXT) like:
{
"resources": [
{
"sequence": 1,
"format": "html",
"path": "ZIP://hOCR/myhocr-hash-for-page1.html"
},
{
"sequence": 2,
"format": "html",
"path": "ZIP://hOCR/myhocr-hash-for-page2.html"
},
{
"sequence": 3,
"format": "html",
"path": "ZIP://hOCR/myhocr-hash-for-page3.html"
}
]
}
Hola Giancarlo, i think i missed reading something, sorry, my fault.
IMHO as we have a Solr document for each page, why we need page sequence here? in addition, if page sequence change then we need to change also each "word" lines? or I was not clear, I meant we don't need page sequence into "word" string, at the end, while we need sequence page number into datapackage.json to order pages.
Yes, you are right, i wrote it incorrectly! We don't need the page in each word. Totally true. Sorry i was distracted. I was trying to say we just need a sequence, a single sequence in each Solr document in case we need to order them.
If I correctly understand, no "parent ID" into datapackage.json because parent STB JSON already includes ZIP reference, so datapackage.json at this stage could be only a list of "resources" hOCR (and without TXT) like:
Yes, that is what i mean. Make moving the ZIP file easier, migrating or reusing somewhere else. We could maintain the book title? Or some metadata if people want to understand to what it refers? What do you think?
Hola Giancarlo, i think i missed reading something, sorry, my fault.
IMHO as we have a Solr document for each page, why we need page sequence here? in addition, if page sequence change then we need to change also each "word" lines? or I was not clear, I meant we don't need page sequence into "word" string, at the end, while we need sequence page number into datapackage.json to order pages.
Yes, you are right, i wrote it incorrectly! We don't need the page in each word. Totally true. Sorry i was distracted. I was trying to say we just need a sequence, a single sequence in each Solr document in case we need to order them.
Don't worry, also my fault, a lot of words and my English not so good ... now it's all clear and I fully agree
If I correctly understand, no "parent ID" into datapackage.json because parent STB JSON already includes ZIP reference, so datapackage.json at this stage could be only a list of "resources" hOCR (and without TXT) like:
Yes, that is what i mean. Make moving the ZIP file easier, migrating or reusing somewhere else. We could maintain the book title? Or some metadata if people want to understand to what it refers? What do you think?
Well, book title could be useful for users to have a some type of "human" reference, I think the code has to add title to the datapackage.json (as other entries) to make user life more simple.
Quick update on some research. Planning on using https://www.drupal.org/docs/8/modules/search-api/developer-documentation/available-plugin-types#datasources this plugins to expose the HOCR, TechMD, etc directly to Solr and bind them to the nodes. @giancarlobi 👀 email with some links and ideas went that way. This should lower the number of lines of code and the need for more entities in our system for a LOT
@giancarlobi i will try putting our today's agreement on how we will proceed with this in simple words, based on your research and current work. Please correct me if something is missing or wrong
We then will have some coffee (ristretto & americano) and test this whole thing until its performant, fast and awesome.
I already marked some of the Checkboxes on top since this agreed on architecture covers those already
I will work on the Seasoner services and a demo processor while you figure out the Datasource (definition and actual indexing into Solr)
This is great! Thanks so much
@DiegoPino I think this is the right place where (re)start talking about hOCR. I installed this (https://dbmdz.github.io/solr-ocrhighlighting/) plugin into my Solr 8.7 and it seems to work. I inserted a doc from hOCR file and query look good. A first note: actual hOCR files I have on I7 need a makeup because plugin want ocr_page tag with and ID or a ppageno property to work. I checked that Tesseract adds ID and ppageno to ocr_page while divu2hocr doesn't. Well, I continue testing this, I'd like to understand if it's best (for Archipelago) one doc per page or multiple pages per doc.
@giancarlobi thanks. Please let me know how this moves forward and how your experiments go. If HOCR still needs to be updated then i will at least advance on my side with the frictionless data package/SBR option. HOCR is normally small in size but we may need to have full control. Thanks!
@DiegoPino The hOCR could be managed with xmlstarlet, i.e. with this
xmlstarlet ed -u '//_:div[@class="ocr_page"]/@title' -v "$(xmlstarlet sel -t -v '//_:div[@class="ocr_page"]/@title' p1.html); ppageno 0" p1.html | xmlstarlet ed -a '//_:div[@class="ocr_page"]' -t attr -n id -v page_001 > p1.xml
you add id and ppageno to p1.html and output to p1.xml.
Also I see plugin uses ppageno instead of id if both are present, so we need only one between ppageno or id.
Also, ppageno must be 0 for front cover (see http://kba.cloud/hocr-spec/1.2/#ppageno).
Now (or better tomorrow) more check with this.
@DiegoPino for your info, a query to solr like this:
http://solr.server.url/solr/archipelago/select?hl.ocr.fl=ocr_text&hl=true&q=ocr_text%3Aissn&hl.ocr.absoluteHighlights=on
produce this:
{
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"id":"ocrdoc-1",
"ocr_text":"/mnt/p1.xml",
"timestamp":"2020-11-15T21:59:51.401Z",
"_version_":1683465270885089280}]
},
"highlighting":{
"ocrdoc-1":{}},
"ocrHighlighting":{
"ocrdoc-1":{
"ocr_text":{
"snippets":[{
"text":"and enjoy open knowledge GIANCARLO BIRELLO, ANNA PERIN <em>ISSN</em> (print): 2421-5783 <em>ISSN</em> (on line): 2421-5562 Rapporto Tecnico",
"score":1461.3793,
"pages":[{
"id":"0",
"width":2481,
"height":3508}],
"regions":[{
"ulx":1461,
"uly":2757,
"lrx":2295,
"lry":2959,
"text":"and enjoy open knowledge GIANCARLO BIRELLO, ANNA PERIN",
"pageIdx":0},
{
"ulx":740,
"uly":171,
"lrx":2344,
"lry":1836,
"text":"<em>ISSN</em> (print): 2421-5783 <em>ISSN</em> (on line): 2421-5562 Rapporto Tecnico",
"pageIdx":0}],
"highlights":[[{
"ulx":1788,
"uly":174,
"lrx":1895,
"lry":213,
"text":"ISSN",
"parentRegionIdx":1}],
[{
"ulx":1741,
"uly":244,
"lrx":1848,
"lry":283,
"text":"ISSN",
"parentRegionIdx":1}]]}],
"numTotal":1}}},
"highlighting":{}}
Looks good. Still i do not see (still) how we could allow Drupal to find the values/connect to the original Entity (no data source). So this would require a custom Solr Query to be executed outside of the normal Drupal way right? Or we would need to add an ID that allows us manually make the logic to connect? Good tests!
We can use your logic as doc ID ingested into Solr as for PDF test runners. In addition to file path, we can add everything we need. Is this the answer to your question?
I ingested into Solr using this json in POST: { "id": "ocrdoc-1", "ocr_text": "/mnt/p1.xml" }
Also, for info, to install plugin I copied downloaded solr-ocrhighlighting-0.5.0.jar into /opt/solr/contrib/ocrsearch/lib/ then modified conf file (starting from your last solr 8 conf):
diff solrconfig.xml solrconfig.xml.ORI
90,93d89
< <lib dir="${solr.install.dir:../../../..}/contrib/ocrsearch/lib" regex=".*\.jar" />
<
<
<
536,540d531
<
<
< <!-- Add a new named search component that takes care of highlighting OCR field values. -->
< <searchComponent class="de.digitalcollections.solrocr.solr.OcrHighlightComponent" name="ocrHighlight" />
<
diff schema_extra_fields.xml schema_extra_fields.xml.ORI
91,92d90
<
< <field name="ocr_text" type="text_ocr" multiValued="false" indexed="true" stored="true" />
diff schema_extra_types.xml schema_extra_types.xml.ORI
217,235d216
< <!--
< ocrHighlight
< 0.5.0
< -->
< <fieldtype name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
< <analyzer type="index">
< <charFilter class="de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory" />
< <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
< <tokenizer class="solr.WhitespaceTokenizerFactory"/>
< <filter class="solr.LowerCaseFilterFactory"/>
< <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
< </analyzer>
< <analyzer type="query">
< <tokenizer class="solr.WhitespaceTokenizerFactory"/>
< <filter class="solr.LowerCaseFilterFactory"/>
< <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
< </analyzer>
< </fieldtype>
<
I like the POST ingest. So this means once ingested I do not need the file anymore? Can I remove it or does the plugin still need access to it via local storage? Also means if this is a POST I can simply put all the other data I need from the datasource into that...
The plugin need to access the file via local storage.
Ok. So this won't work for S3 storage then. Still good to explore for people with the resources like you to run all on local, or, for someone with a mixed use case where we can have at least the OCR locally available... (wonder how much a 1,000,000 HOCR would take in space.. maybe its not much)
Will also ask Johannes if there is some planned remote fetch? maybe the file can in the future be a URL
But the space into Solr is really reduced rispect to index complet hocr with every word + coordinates
Really S3 doesn't allow that? opsssssss
yes. I agree. It is one thing or the other. Either Solr taking the space or filesystem. Makes sense.
Not from OS Directly. In PHP it is treated as local (in the sense of it allows file operations, stats, cp, etc, because it is a streamwrapper) but Linux needs drivers for that (s3fs) https://www.nakivo.com/blog/mount-amazon-s3-as-a-drive-how-to-guide/ Which yes, could help
Well, we don't have to use this plugin, we can take idea from this. Better if we can talk about this in a conf next week, amigo.
I like the plugin! Feel one thing less to code and I like that. Yes, let's test more. I can see this as a better option in many cases
Ok, more test and we discuss it next talk.
@DiegoPino some few more steps. 1) I mounted the data shared folder on Solr VM using sshfs as for Archipelago VM 2) Wrote a bash script to create hOCR file (with page ID and ppageno) for each page of a pdf file (could be better, in the meantime it works):
#!/bin/bash
NPAGES=$(qpdf --show-npages $1)
echo "Pages number: "$NPAGES
pdf2djvu --no-metadata -j0 --guess-dpi -o full.djv $1
echo "DJVU file created"
PAGE=1
while [ $PAGE -le $NPAGES ]
do
echo "Page: "$PAGE
PPAGENO=$(( $PAGE - 1))
djvu2hocr -p $PAGE full.djv | xmlstarlet fo -D | xmlstarlet ed -a '//_:div[@class="ocr_page"]' -t attr -n id -v 'page_'$PAGE | xmlstarlet ed -u '//_:div[@class="ocr_page"]/@title' -v "$(djvu2hocr -p "$PAGE" full.djv | xmlstarlet fo -D | xmlstarlet sel -t -v '//_:div[@class="ocr_page"]/@title'); ppageno "$PPAGENO > page_$PAGE.xml
PAGE=$(( $PAGE + 1 ))
done
3) check a Solr upload with multiple pages as :
{
"id": "ocrdoc-1",
"ocr_text": "/mnt/archicantadata/hocrtest/page_1.xml+/mnt/archicantadata/hocrtest/page_2.xml+/mnt/archicantadata/hocrtest/page_3.xml"
}
but I'm not sure this is a good solution: less doc into Solr but with 1000 pages how long line and does it overflow??
Well more test now!
Great stuff. Yes. I wonder how much gain it is. Probably more tests on your side will help. I will think about this. Also I do not think I have djvu2hocr and pdf2djvu in my docker container, should add just in case!
@DiegoPino I started playing with MiniOCR, here a first pretty simple script to convert from hOCR (djvu output) to MiniOCR, next step will be check update Solr with this as inline doc.
<?php
$val = getopt("i:p:");
$xml = simplexml_load_file($val['i']);
echo '<ocr>' . "\n";
foreach ($xml->body->children() as $page){
$coos = explode(" ", substr($page['title'], 5));
echo '<p id="' . $val['p'] . '" wh="' . $coos[2] . " " . $coos[3] . '">' . "\n";
echo '<b>' . "\n";
foreach ($page->children() as $line){
echo '<l>';
foreach ($line->children() as $word){
$wcoos = explode(" ", $word['title']);
echo '<w x="' . $wcoos[1] . ' ' . $wcoos[2] . ' ' . $wcoos[3] . ' ' . $wcoos[4] . '">' . $word . '</w> ';
}
echo '</l>' . "\n";
}
echo '</b>' . "\n";
echo '</p>' . "\n";
}
echo '</ocr>' . "\n";
?>
Great. I think this is something we will have to concede. I liked our JSON representation better but this one works with the plugin so let's go with it!
Closing as resolved!
USE CASE
With @giancarlobi we have been exploring the most efficient way of dealing with OCR and HOCR (or Alto) in an Archipelago/strawberryfield environment. This issue is here to explore all the options and weight Pros and cons of each approach and also discuss how this affects daily operations like editing/re-ocr-ring and access.
Functional requirements for a valid solution are:
In general what we search is a combination/weighted approach that balances all this needs. We won't have a perfect solution that answers 100% all requirements but at least we can weight and discern which of those can we give a priority.
Please feel free to add comments and ideas, questions and other functional requirements. Thanks!