MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Issue 421 #459

Closed mjordan closed 6 years ago

mjordan commented 6 years ago

Github issue: (#421)

What does this Pull Request do?

Adds the ability to include page-level OCR for book pages in the MIK CSV Book input. Use case here is that the OCR is generated outside Islandora. The CSV Newspapers toolchain already has this capability.

What's new?

Changes to the CSV Books Writer class and CSV Books Input Validator class files that parallel those in the CSV newspaper classes.

How should this be tested?

This change does not include PHPUnit tests, so it needs to be smoke tested. To test:

  1. Check out the issue-421 branch.
  2. Unzip the file attached to this PR, which includes test configuration and data.
  3. Adjust the paths in issue-421.ini to suit your system
  4. Test the new functionality by running php mik -c issue-421.ini. The book ingest packages that are created by MIK should have OCR.txt files in all page-level directories.
  5. Test that the input validator works by deleting the output and temp directories and changing input_directory = "issue-421/files_no_text" in your .ini file and rerunning MIK. Since the input directories do not contain OCR files, MIK will not produce any ingest packages and the input_validator log will indicate that there are missing OCR files.
  6. Test that this PR works when there are not OCR files in the input directories, by deleting the output and temp directories and changing changing log_missing_ocr_files in your .ini file to FALSE and rerunning MIK. The output should be ingest packages that do not contain OCR.txt files. No problems should appear in any of the log files. log_missing_ocr_files has a default value of 'FALSE', so not including it in your .ini file will preserve the same behaviour that existed prior to this PR.

issue-421.zip

Additional Notes

Any additional information that you think would be helpful when reviewing this PR.

Example:

Yes. The section in the CSV Newspapers toolchain wiki page "Including page-level OCR files" can be adapted for the CSV Books wiki page.

No.

Yes, but test data demonstrates that it doesn't.

Interested parties

@bondjimbond @MarcusBarnes

mjordan commented 6 years ago

PHPUnit tests are failing..... I'll look into it.

mjordan commented 6 years ago

Closing this PR until I can figure out what's going on with the PHPUnit tests.

bondjimbond commented 6 years ago

I was just about to test this one today, based on your comments in #457. Will wait, then.

mjordan commented 6 years ago

I seem to have fixed the issue, will reopen in little while.