MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

CsvBooks and CsvNewspapers file getters includes files from unrelated input directories #466

Closed mjordan closed 6 years ago

mjordan commented 6 years ago

The CsvBooks filegetter will select child (page) files that do not belong to the parent book. Here's the faulty code:

foreach ($this->OBJFilePaths as $path) {
    if (strpos($path, $book_input_path) === 0) {
        $page_paths[] = $path;
    }
}

The following list of "pages" of letter is an example of what happens:

"M:\\input\\tiffs\\input\\new_narratives\\letter1\\P0-03.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter1\\P0-04.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter1\\P0-05.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter10\\P0-53.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter10\\P0-54.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter10\\P0-56.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter11\\P0-57.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter11\\P0-58.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter11\\P0-59.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter11\\P0-60.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter12\\P0-61.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter12\\P0-62.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter12\\P0-63.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter12\\P0-64.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter12\\P0-65.tif",
"M:\\input\\tiffs\\input\\new_narratives\\letter13\\P0-66.tif",

We need a more precise match so that "letter10", "letter11", "letter12", and other directories whose paths are supersets of the desired string do not get into the "pages" list.

CsvNewspapers filegetter is also susceptible to this. CsvCompound appears not to be.

mjordan commented 6 years ago

OK, fix is in place. How the list of pages for "letter1" contains only the files in the intended input directory:

array(3) {
  [0]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-03.tif"
  [1]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-04.tif"
  [2]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-05.tif"
}

Will open PR.

MarcusBarnes commented 6 years ago

Addressed in pull-request https://github.com/MarcusBarnes/mik/pull/467 (merged with commit https://github.com/MarcusBarnes/mik/commit/ff5788da76ed5fd930a67694a91216258268f787). Thank you @mjordan for improving the path checking so that MIK can handle the situation described in this issue for CSV newspapers and books.