crkn-rcdr / cihm-metadatabus

Documentation and Docker build environment for key portions of the metadata bus.
BSD 2-Clause "Simplified" License
0 stars 1 forks source link

Tool to import OCR data into Access from a directory created by wip-tdrexport #38

Closed RussellMcOrmond closed 2 years ago

RussellMcOrmond commented 2 years ago

We have a set of OCR files (from tens of thousands of images) where the image came from exporting a SIP from the repository and running ABBYY.

We need to create a tool (based on, or a feature added to "importocr") that will associate those files with the correct Canvases, and initiate cache generating for all Manifests which contain those Canvases.

RussellMcOrmond commented 2 years ago

Next step is for @JLoitzenbauer-CRKN to let me know where the current OCR files are.

JLoitzenbauer-CRKN commented 2 years ago

WIP:_OCR - To do_Russell_Ingest

This just has the Heritage reels. There are some other ones, too, but this should keep you busy for now.

RussellMcOrmond commented 2 years ago

I did a quick look at "eclipse:/media/crkn-nas-wip/_OCR - To do/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15869"

Wow -- March.

Starting to work on tool to scan directories. First check will be that there is a matching slug, as the addition of images will only work if the "Import into Access" has already been done for the matching images.

RussellMcOrmond commented 2 years ago

@JLoitzenbauer-CRKN ,

Should I ignore the directory named "lac_reel_t15801 - missing many OCR files"?

russell@eclipse:/media/crkn-nas-wip/_OCR - To do/_Russell_Ingest$ find -type d 
.
./Batch 500 - Heritage - completed
./Batch 500 - Heritage - completed/lac_reel_t15869
./Batch 500 - Heritage - completed/lac_reel_t15870
./Batch 500 - Heritage - completed/lac_reel_t15867
./Batch 500 - Heritage - completed/lac_reel_t15868
./Batch 475 - heritage - done - verified
./Batch 475 - heritage - done - verified/lac_reel_t15796
./Batch 475 - heritage - done - verified/lac_reel_t15797
./Batch 475 - heritage - done - verified/lac_reel_t15795
./Batch 477 - Heritage - completed
./Batch 477 - Heritage - completed/lac_reel_t15596
./Batch 477 - Heritage - completed - verified
./Batch 477 - Heritage - completed - verified/lac_reel_t15596
./Batch 478 - Heritage - completed
./Batch 478 - Heritage - completed/lac_reel_t15801 - missing many OCR files
./Batch 480 - Heritage - completed
./Batch 480 - Heritage - completed/lac_reel_t15808
./Batch 480 - Heritage - completed/lac_reel_t15809
./Batch 480 - Heritage - completed/lac_reel_t15807
./Batch 480 - Heritage - completed/lac_reel_t15663
./Batch 482 - Heritage - completed
./Batch 482 - Heritage - completed/lac_reel_t15814
./Batch 482 - Heritage - completed/lac_reel_t15815
./Batch 482 - Heritage - completed/lac_reel_t15813
./Batch 488 - Heritage - completed - verified
./Batch 488 - Heritage - completed - verified/lac_reel_t15832
./Batch 488 - Heritage - completed - verified/lac_reel_t15833
./Batch 488 - Heritage - completed - verified/lac_reel_t15831
./Batch 490 - Heritage - completed - verified
./Batch 490 - Heritage - completed - verified/lac_reel_t15838
./Batch 490 - Heritage - completed - verified/lac_reel_t15839
./Batch 490 - Heritage - completed - verified/lac_reel_t15837
./Batch 486 - Heritage
./Batch 486 - Heritage/lac_reel_t15826
./Batch 486 - Heritage/lac_reel_t15827
./Batch 486 - Heritage/lac_reel_t15825
./Batch 487 - Heritage
./Batch 487 - Heritage/lac_reel_t15829
./Batch 487 - Heritage/lac_reel_t15830
./Batch 487 - Heritage/lac_reel_t15828
russell@eclipse:/media/crkn-nas-wip/_OCR - To do/_Russell_Ingest$ 
RussellMcOrmond commented 2 years ago

Tool scans the directories:

Next step is to actually load the OCR data into the Access platform.

tdr@6b67f2c709d7:~$ ocrload
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15869  (ID=oocihm.lac_reel_t15869)
Slugs of Manifests: oocihm.lac_reel_t15869
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15870  (ID=oocihm.lac_reel_t15870)
Slugs of Manifests: oocihm.lac_reel_t15870
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15867  (ID=oocihm.lac_reel_t15867)
Slugs of Manifests: oocihm.lac_reel_t15867
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15868  (ID=oocihm.lac_reel_t15868)
Slugs of Manifests: oocihm.lac_reel_t15868
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 475 - heritage - done - verified/lac_reel_t15796  (ID=oocihm.lac_reel_t15796)
Slugs of Manifests: oocihm.lac_reel_t15796
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 475 - heritage - done - verified/lac_reel_t15797  (ID=oocihm.lac_reel_t15797)
Slugs of Manifests: oocihm.lac_reel_t15797
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 475 - heritage - done - verified/lac_reel_t15795  (ID=oocihm.lac_reel_t15795)
Slugs of Manifests: oocihm.lac_reel_t15795
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 477 - Heritage - completed/lac_reel_t15596  (ID=oocihm.lac_reel_t15596)
No canvases found -- are the images in the directory?
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 477 - Heritage - completed - verified/lac_reel_t15596  (ID=oocihm.lac_reel_t15596)
Slugs of Manifests: oocihm.lac_reel_t15596
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15808  (ID=oocihm.lac_reel_t15808)
Slugs of Manifests: oocihm.lac_reel_t15808
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15809  (ID=oocihm.lac_reel_t15809)
Slugs of Manifests: oocihm.lac_reel_t15809
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15807  (ID=oocihm.lac_reel_t15807)
Slugs of Manifests: oocihm.lac_reel_t15807
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15663  (ID=oocihm.lac_reel_t15663)
Slugs of Manifests: oocihm.lac_reel_t15663
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 482 - Heritage - completed/lac_reel_t15815  (ID=oocihm.lac_reel_t15815)
Slugs of Manifests: oocihm.lac_reel_t15815
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15832  (ID=oocihm.lac_reel_t15832)
Slugs of Manifests: oocihm.lac_reel_t15832
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15833  (ID=oocihm.lac_reel_t15833)
Slugs of Manifests: oocihm.lac_reel_t15833
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15831  (ID=oocihm.lac_reel_t15831)
Slugs of Manifests: oocihm.lac_reel_t15831
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 490 - Heritage - completed - verified/lac_reel_t15838  (ID=oocihm.lac_reel_t15838)
Slugs of Manifests: oocihm.lac_reel_t15838
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 490 - Heritage - completed - verified/lac_reel_t15839  (ID=oocihm.lac_reel_t15839)
Slugs of Manifests: oocihm.lac_reel_t15839
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 490 - Heritage - completed - verified/lac_reel_t15837  (ID=oocihm.lac_reel_t15837)
Slugs of Manifests: oocihm.lac_reel_t15837
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 486 - Heritage/lac_reel_t15826  (ID=oocihm.lac_reel_t15826)
Slugs of Manifests: oocihm.lac_reel_t15826
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 486 - Heritage/lac_reel_t15827  (ID=oocihm.lac_reel_t15827)
Slugs of Manifests: oocihm.lac_reel_t15827
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 486 - Heritage/lac_reel_t15825  (ID=oocihm.lac_reel_t15825)
Slugs of Manifests: oocihm.lac_reel_t15825
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 487 - Heritage/lac_reel_t15829  (ID=oocihm.lac_reel_t15829)
Slugs of Manifests: oocihm.lac_reel_t15829
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 487 - Heritage/lac_reel_t15830  (ID=oocihm.lac_reel_t15830)
Slugs of Manifests: oocihm.lac_reel_t15830
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 487 - Heritage/lac_reel_t15828  (ID=oocihm.lac_reel_t15828)
Slugs of Manifests: oocihm.lac_reel_t15828
tdr@6b67f2c709d7:~$ 
RussellMcOrmond commented 2 years ago

Code merged.

Tool is processing last two directories.

/home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15833
/home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15831

I will want to get some feedback from @JLoitzenbauer-CRKN if this specific batch can be declared done, and if there are more batches coming?

RussellMcOrmond commented 2 years ago

Load is complete. I am re-running the tool which will verify that all the MD5's of the stored OCR data matches the files on the filesystem. I'm pretty certain everything is correct, but there were enhancements made to the tool as it was progressing.

Noticing:

Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 477 - Heritage - completed/lac_reel_t15596  (ID=oocihm.lac_reel_t15596)
No canvases found -- are the images in the directory?
Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 477 - Heritage - completed - verified/lac_reel_t15596  (ID=oocihm.lac_reel_t15596)
Slugs of Manifests: oocihm.lac_reel_t15596

This is the same AIP ID - one had all the canvases, and one did not. The one that did not can be ignored and I believe that AIP ID can be declared complete.

oocihm.lac_reel_t15596: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 477 - Heritage - completed - verified/lac_reel_t15596
oocihm.lac_reel_t15596: Updating 3711 canvases.
oocihm.lac_reel_t15596: Slugs of Manifests: oocihm.lac_reel_t15596
RussellMcOrmond commented 2 years ago
oocihm.lac_reel_t15795: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 475 - heritage - done - verified/lac_reel_t15795
oocihm.lac_reel_t15795: Updating 5567 canvases.
oocihm.lac_reel_t15795: Slugs of Manifests: oocihm.lac_reel_t15795
oocihm.lac_reel_t15796: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 475 - heritage - done - verified/lac_reel_t15796                                                                                                                                               
oocihm.lac_reel_t15796: Updating 2247 canvases.
oocihm.lac_reel_t15796: Slugs of Manifests: oocihm.lac_reel_t15796
oocihm.lac_reel_t15797: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 475 - heritage - done - verified/lac_reel_t15797   
oocihm.lac_reel_t15797: Updating 5181 canvases.
oocihm.lac_reel_t15797: Slugs of Manifests: oocihm.lac_reel_t15797
oocihm.lac_reel_t15596: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 477 - Heritage - completed - verified/lac_reel_t15596
oocihm.lac_reel_t15596: Updating 3711 canvases.
oocihm.lac_reel_t15596: Slugs of Manifests: oocihm.lac_reel_t15596
oocihm.lac_reel_t15663: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15663
oocihm.lac_reel_t15663: Updating 5712 canvases.
oocihm.lac_reel_t15663: Slugs of Manifests: oocihm.lac_reel_t15663
oocihm.lac_reel_t15807: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15807
oocihm.lac_reel_t15807: Updating 4920 canvases.
oocihm.lac_reel_t15807: Slugs of Manifests: oocihm.lac_reel_t15807
oocihm.lac_reel_t15808: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15808
oocihm.lac_reel_t15808: Updating 4856 canvases.
oocihm.lac_reel_t15808: Slugs of Manifests: oocihm.lac_reel_t15808
oocihm.lac_reel_t15809: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 480 - Heritage - completed/lac_reel_t15809
oocihm.lac_reel_t15809: Updating 5111 canvases.
oocihm.lac_reel_t15809: Slugs of Manifests: oocihm.lac_reel_t15809
oocihm.lac_reel_t15815: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 482 - Heritage - completed/lac_reel_t15815
oocihm.lac_reel_t15815: Updating 5224 canvases.
oocihm.lac_reel_t15815: Slugs of Manifests: oocihm.lac_reel_t15815
oocihm.lac_reel_t15825: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 486 - Heritage/lac_reel_t15825
oocihm.lac_reel_t15825: Updating 5063 canvases.
oocihm.lac_reel_t15825: Slugs of Manifests: oocihm.lac_reel_t15825
oocihm.lac_reel_t15826: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 486 - Heritage/lac_reel_t15826
oocihm.lac_reel_t15826: Updating 5075 canvases.
oocihm.lac_reel_t15826: Slugs of Manifests: oocihm.lac_reel_t15826
oocihm.lac_reel_t15827: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 486 - Heritage/lac_reel_t15827
oocihm.lac_reel_t15827: Updating 4563 canvases.
oocihm.lac_reel_t15827: Slugs of Manifests: oocihm.lac_reel_t15827
oocihm.lac_reel_t15828: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 487 - Heritage/lac_reel_t15828
oocihm.lac_reel_t15828: Updating 5190 canvases.
oocihm.lac_reel_t15828: Slugs of Manifests: oocihm.lac_reel_t15828
oocihm.lac_reel_t15829: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 487 - Heritage/lac_reel_t15829
oocihm.lac_reel_t15829: Updating 5264 canvases.
oocihm.lac_reel_t15829: Slugs of Manifests: oocihm.lac_reel_t15829
oocihm.lac_reel_t15830: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 487 - Heritage/lac_reel_t15830
oocihm.lac_reel_t15830: Updating 5271 canvases.
oocihm.lac_reel_t15830: Slugs of Manifests: oocihm.lac_reel_t15830
oocihm.lac_reel_t15831: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15831
oocihm.lac_reel_t15831: Updating 5272 canvases.
oocihm.lac_reel_t15831: Slugs of Manifests: oocihm.lac_reel_t15831
oocihm.lac_reel_t15832: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15832
oocihm.lac_reel_t15832: Updating 5197 canvases.
oocihm.lac_reel_t15832: Slugs of Manifests: oocihm.lac_reel_t15832
oocihm.lac_reel_t15833: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 488 - Heritage - completed - verified/lac_reel_t15833
oocihm.lac_reel_t15833: Updating 5182 canvases.
oocihm.lac_reel_t15833: Slugs of Manifests: oocihm.lac_reel_t15833
oocihm.lac_reel_t15837: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 490 - Heritage - completed - verified/lac_reel_t15837
oocihm.lac_reel_t15837: Updating 5089 canvases.
oocihm.lac_reel_t15837: Slugs of Manifests: oocihm.lac_reel_t15837
oocihm.lac_reel_t15838: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 490 - Heritage - completed - verified/lac_reel_t15838
oocihm.lac_reel_t15838: Updating 5066 canvases.
oocihm.lac_reel_t15838: Slugs of Manifests: oocihm.lac_reel_t15838
oocihm.lac_reel_t15839: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 490 - Heritage - completed - verified/lac_reel_t15839
oocihm.lac_reel_t15839: Updating 5078 canvases.
oocihm.lac_reel_t15839: Slugs of Manifests: oocihm.lac_reel_t15839
oocihm.lac_reel_t15867: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15867
oocihm.lac_reel_t15867: Updating 7589 canvases.
oocihm.lac_reel_t15867: Slugs of Manifests: oocihm.lac_reel_t15867
oocihm.lac_reel_t15868: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15868
oocihm.lac_reel_t15868: Updating 7664 canvases.
oocihm.lac_reel_t15868: Slugs of Manifests: oocihm.lac_reel_t15868
oocihm.lac_reel_t15869: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15869
oocihm.lac_reel_t15869: Updating 7658 canvases.
oocihm.lac_reel_t15869: Slugs of Manifests: oocihm.lac_reel_t15869
oocihm.lac_reel_t15870: Processing: /home/tdr/ocr-todo/_Russell_Ingest/Batch 500 - Heritage - completed/lac_reel_t15870
oocihm.lac_reel_t15870: Updating 3981 canvases.
oocihm.lac_reel_t15870: Slugs of Manifests: oocihm.lac_reel_t15870
RussellMcOrmond commented 2 years ago

All working well. Tool will be decommissioned later, once it is known if all outstanding directories are completed.