MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Issue 383 #387

Closed mjordan closed 7 years ago

mjordan commented 7 years ago

Github issue: (#383, #316)

What does this Pull Request do?

Optionally, makes MIK copy OCR (.txt) files into page-level directories in the CSV Newspapers toolchain.

This PR also includes some cleanup of the CSV newspaper writer class as describe in #316.

What's new?

If the input directory for jobs using the MIK CSV Newspapers toolchain contain .txt files corresponding to the page master images, like this:

1910-01-01
  page-01.tif
  page-01.txt
  page-02.tif
  page-02.txt

the .txt files will be copied into the newspaper page-level Islandora ingest packages, like this:

├── TT0002
│   ├── 1
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   ├── 2
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   ├── 3
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   ├── 4
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   └── MODS.xml

Because this feature introduces a new optional entry in the [WRITER]datastreams[] list, we need to provide a new configuration option to indicate that MIK should log missing OCR files when the datastreams list is empty:

[WRITER]
; Default is FALSE
log_missing_ocr_files = TRUE

In addition, when [WRITER]log_missing_ocr_files is TRUE, the CSV Newspapers input validtor checks for the existence of the .txt files and if any are not found, logs an input validation error.

This PR includes PHPUnit tests for both the CSV Newspapers toolchain and for the CSV Newspapers input validator.

How should this be tested?

Run PHPUnit tests:

phpunit --exclude-group inputvalidators --bootstrap vendor/autoload.php tests

should result in "(46 tests, 66 assertions)"

phpunit --group inputvalidators --bootstrap vendor/autoload.php tests 

should result in "(4 tests, 17 assertions)"

Smoketest:

Using configuration and data files in the attached .zip, do the following:

  1. Check out the issue-383 branch
  2. Asssuming you have unzipped the file within your mik directory, run ./mik -c issue-383/issue-383.ini -cc all and then if there are no problems, ./mik -c issue-383/issue-383.ini

Your output directory should look like this:

issue_383_output/
├── input_validator.log
├── mik.log
├── problem_records.log
├── TT0002
│   ├── 1
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   ├── 2
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   ├── 3
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   ├── 4
│   │   ├── MODS.xml
│   │   ├── OBJ.tif
│   │   └── OCR.txt
│   └── MODS.xml
└── TT0003
    ├── 1
    │   ├── MODS.xml
    │   ├── OBJ.tif
    │   └── OCR.txt
    ├── 2
    │   ├── MODS.xml
    │   ├── OBJ.tif
    │   └── OCR.txt
    ├── 3
    │   ├── MODS.xml
    │   ├── OBJ.tif
    │   └── OCR.txt
    ├── 4
    │   ├── MODS.xml
    │   ├── OBJ.tif
    │   └── OCR.txt
    └── MODS.xml

MIK only generated two packages because one (TT001) has a missing .txt file. You can verify this by looking at the input validator and problem records log file.

To test that this new feature has no side effects in jobs that do not include page-level .txt files, change the input directory to "issue-383/files_no_text" and comment out the [WRITER]log_missing_ocr_files config option.

issue-383.zip

MarcusBarnes commented 7 years ago

PHPUnit tests passed as described.

MarcusBarnes commented 7 years ago

The smoke test worked as expected. Beautiful work. Thank you @mjordan.

mjordan commented 7 years ago

Awesome, thanks for testing @MarcusBarnes. I'll update the CSV Newspaper toolchain wiki page.

mjordan commented 7 years ago

Can I close #316?