MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Allow inclusion of OCR files in newspaper issue directories #383

Closed mjordan closed 7 years ago

mjordan commented 7 years ago

Currently, input for the CSV Newspaper toolchain only allows master page image files:

├── 1900-01-01
│   ├── 1900-01-01-01.tif
│   ├── 1900-01-01-02.tif
|   └── 1900-01-01-03.tif
└── 1900-01-02
    ├── 1900-01-02-01.tif
    ├── 1900-01-02-02.tif
    └── 1900-01-02-03.tif

We should allow the inclusion of OCR .txt files to accommodate users who prefer to generate OCR outside of Islandora:

├── 1900-01-01
│   ├── 1900-01-01-01.tif
│   ├── 1900-01-01-01.txt
│   ├── 1900-01-01-02.tif
│   ├── 1900-01-01-02.txt
│   ├── 1900-01-01-03.tif
│   └── 1900-01-01-03.txt
└── 1900-01-02
    ├── 1900-01-02-01.tif
    ├── 1900-01-02-01.txt
    ├── 1900-01-02-02.tif
    ├── 1900-01-02-02.txt
    ├── 1900-01-02-03.tif
    └── 1900-01-02-03.txt

The *.txt files will get copied into the ingest packages as OCR.txt datastream files.

MarcusBarnes commented 7 years ago

Addressed in pull-request https://github.com/MarcusBarnes/mik/pull/387 (merged with commit https://github.com/MarcusBarnes/mik/commit/0d64935d9d51df4220a60fc6fd8a5552b24758fb).