MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Issue 421 - allow page-level OCR files in MIK input for CSV Books #460

Closed mjordan closed 6 years ago

mjordan commented 6 years ago

Github issue: (#421)

What does this Pull Request do?

Adds the ability to include page-level OCR for book pages in the MIK CSV Book input. Use case here is that the OCR is generated outside Islandora. The CSV Newspapers toolchain already has this capability.

What's new?

Changes to the CSV Books Writer class and CSV Books Input Validator class files that parallel those in the CSV newspaper classes.

How should this be tested?

This change does not include PHPUnit tests, so it needs to be smoke tested. However, PHPUnit tests continue to pass.

To smoke test:

  1. Check out the issue-421 branch.
  2. Unzip the file attached to this PR, which includes test configuration and data.
  3. Adjust the paths in issue-421.ini to suit your system
  4. Test the new functionality by running php mik -c issue-421.ini. The book ingest packages that are created by MIK should have OCR.txt files in all page-level directories.
  5. Test that the input validator works by deleting the output and temp directories and changing the input_directory value to input_directory = "issue-421/files_no_text" in your .ini file and rerunning MIK. Since the input directories do not contain OCR files, MIK will not produce any ingest packages and the input_validator log will indicate that there are missing OCR files.
  6. Test that this PR works when there are not OCR files in the input directories, by deleting the output and temp directories and changing changing log_missing_ocr_files in your .ini file to FALSE and rerunning MIK. The output should be ingest packages that do not contain OCR.txt files. No problems should appear in any of the log files. log_missing_ocr_files has a default value of 'FALSE', so not including it in your .ini file will preserve the same behaviour that existed prior to this PR.

issue-421.zip

Additional Notes

Any additional information that you think would be helpful when reviewing this PR.

Example:

Yes. The section in the CSV Newspapers toolchain wiki page "Including page-level OCR files" can be adapted for the CSV Books wiki page.

No.

Yes, but test data demonstrates that it doesn't.

Interested parties

@bondjimbond @MarcusBarnes

bondjimbond commented 6 years ago

Have not yet tested the main functionality of this PR. I can however report that this does not fix #457 - I'm still getting failures because of the missing output directory. If I create the directory, it's fine.

mjordan commented 6 years ago

OK, thanks. I'd be curious to know if you see the same behavior while testing with the data I include in the .zip.

bondjimbond commented 6 years ago

Since page-02.tif in the first directory is missing its TXT file, it opened up an extra level of testing.

Appears to work as intended. Creating book 1 failed because one TIFF page did not have a companion TXT file. Is this always desired behaviour? What about books where some pages are illustrated and have no text at all?

If setting log_missing_ocr_files = FALSE, the no_text files generate just fine. If you try it on standard files, then you get an error: "Some files in the book directory have invalid extensions."

It looks like there's no option to have books where some pages have OCR and some don't. Can that be addressed?

mjordan commented 6 years ago

@bondjimbond when you say "failed" I assume you mean it didn't pass input validation. If that's the case, can you add:

[FILE_GETTER]
validate_input = false

to you .ini and try again? That option will skip validation and let the books that only has some page OCR files pass. See https://github.com/MarcusBarnes/mik/wiki/Cookbook:-Validate-MIK's-input-files for more info.

bondjimbond commented 6 years ago

Ah, that did the job! Thanks. I'd bet this would be kind of a stumbling block, though; could a note on the use of validate_input = false to allow for a mix of ocr/non-ocr be added to the toolchain's documentation?

bondjimbond commented 6 years ago

Also: This zip had no trouble creating the output directory. I don't know why mine won't. Some difference in our .ini files?

doh_books.ini.zip

mjordan commented 6 years ago

@bondjimbond good idea on the note. Adding input validation (which is a good thing) has introduced a lot of complexity.

WRT the problem with the output directory, I notice that yours has a few levels of subdirectories. Do all but the last already exist? I think that PHP (like Linux/OSX's mkdir command) will not create parent directories. For example with your path, /Volumes/Arca/doh/output, does 'doh' already exist, or are you expecting MIK to create doh/output, or Arca/doh/output? Both of those will fail.

bondjimbond commented 6 years ago

@mjordan

For example with your path, /Volumes/Arca/doh/output, does 'doh' already exist, or are you expecting MIK to create doh/output, or Arca/doh/output?

Only the last directory needs to be created. The rest of the path is already there.

MarcusBarnes commented 6 years ago

@mjordan @bondjimbond A quick thought - double check the directory permissions to make sure that MIK would be able to create the directory or directories?

mjordan commented 6 years ago

@MarcusBarnes good point.

@bondjimbond since #457 is a separate issue, let's follow up on it separately from this PR. If you're happy with the testing of this PR, can we merge it, and then I can updated the documentation as necessary?

mjordan commented 6 years ago

Saweet, thanks. I've got some OCRed books waiting in a queue!