MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Work on #466. #467

Closed mjordan closed 6 years ago

mjordan commented 6 years ago

Github issue: #466

What does this Pull Request do?

Fixes a bug where files under a path like /foo/bar are the intended "children" of a book or newspaper issue, but files under paths that begin with the intended path, such as /foo/barfly are erroneously included.

What's new?

Instead of relying on a simple substring match such as if (strpos($path, $book_input_path) === 0), the CsvBooks and CsvNewspapers filegetters now use more explicit directory-matching code, if ($current_book_dirname === $this->input_directory . DIRECTORY_SEPARATOR . $book_directory).

How should this be tested?

Running PHPUnit tests shows that this change does not introduce any regression errors, and that the toolchain tests for CsvBooks and CsvNewspapers work as before the change.

An example of the list of pages in a book that illustrates the bug prior to the change is:

array(51) {
  [0]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-03.tif"
  [1]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-04.tif"
  [2]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-05.tif"
 // Everything from here on down should not be in the list, and is included
 // because the directory they are in starts with the string "letter1".
  [3]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter10\P0-53.tif"
  [4]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter10\P0-54.tif"
  [5]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter10\P0-55.tif"
  [6]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter10\P0-56.tif"
  [7]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter11\P0-57.tif"
  [8]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter11\P0-58.tif"
  [9]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter11\P0-59.tif"
  [10]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter11\P0-60.tif"
  [11]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter12\P0-61.tif"
  [12]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter12\P0-62.tif"
  [13]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter12\P0-63.tif"
  [14]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter12\P0-64.tif"
  [15]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter12\P0-65.tif"
  [16]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter13\P0-66.tif"
  [17]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter13\P0-67.tif"
  [18]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter13\P0-68.tif"
  [19]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter13\P0-69.tif"
  [20]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter13\P0-70.tif"
  [21]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-71.tif"
  [22]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-72.tif"
  [23]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-73.tif"
  [24]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-74.tif"
  [25]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-75.tif"
  [26]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-76.tif"
  [27]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-77.tif"
  [28]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-78.tif"
  [29]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-79.tif"
  [30]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter14\P0-80.tif"
  [31]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter15\P0-80.tif"
  [32]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter15\P0-81.tif"
  [33]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter15\P0-82.tif"
  [34]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter15\P0-83.tif"
  [35]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter15\P0-84.tif"
  [36]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter16\P0-85.tif"
  [37]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter16\P0-86.tif"
  [38]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter16\P0-87.tif"
  [39]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter16\P0-88.tif"
  [40]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter17\P0-89.tif"
  [41]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter17\P0-90.tif"
  [42]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter17\P0-91.tif"
  [43]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter17\P0-92.tif"
  [44]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter18\P0-92.tif"
  [45]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter18\P0-93.tif"
  [46]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter18\P0-94.tif"
  [47]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter19\P0-95.tif"
  [48]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter19\P0-96.tif"
  [49]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter19\P0-97.tif"
  [50]=>
  string(54) "M:\input\tiffs\input\new_narratives\letter19\P0-98.tif"
}

After the fix, the list of pages for the book only contains the files in the "letter1" directory:

array(3) {
  [0]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-03.tif"
  [1]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-04.tif"
  [2]=>
  string(53) "M:\input\tiffs\input\new_narratives\letter1\P0-05.tif"
}

Providing sample data to capture this bug, across two toolchains, is complex. I hope the PHPUnit tests and the var_dump()ed output showing the list of pages detected for the "letter1" object are sufficient evidence that the fix works.

Interested parties

@MarcusBarnes @bondjimbond

mjordan commented 6 years ago

Bump... as this is a genuine bug I'd like to have it merged. I can provide test data if you feel it's necessary, but I'd also be happy to provide extra PHPUnit tests specifically for this bug.

mjordan commented 6 years ago

I've provided a new PHPUnit test for this issue, testWritePackagesIssue466(), and it passes.

WRT newspapers, I can't think of a situation where this problem would occur since "each issue's page files must be in a directory named with the date of the issue in yyyy-mm-dd format." Therefore, even though I modified the CsvNewspaper filegetter to include the fix, I probably didn't need to. The existing toolchain tests for newspapers passes with the fix.

bondjimbond commented 6 years ago

Hi Mark. I'm not really familiar with PHPUnit testing, so I can't really comment on that. If @MarcusBarnes is ready to merge based on the above, then great. If you have sample data that I can use to test, then I'd be happy to do so.

MarcusBarnes commented 6 years ago

@mjordan @bondjimbond I've just reviewed the changes and it looks good. Give me a bit and I'll run the unit tests, and if it still looks good, I'll merge the changes.

mjordan commented 6 years ago

Thanks guys. I did add test data for the new PHPUnit test (for books only), but as I say, I don't really think anyone will encounter this problem in the wild with newspapers if they have well-formed input (yyyy-mm-dd issue folders). Running the input validation will detect those.

With the CSV toolchains, we should really have PHPUnit tests for everthing, since all the necessary input data can be included in our repo. For Cdm and OAI, and whatever else we come up with that consumes external data, we are a bit more constrained.

mjordan commented 6 years ago

@MarcusBarnes I'd like to push one more commit, to improve the clarity of my new test. At https://github.com/MarcusBarnes/mik/blob/issue-466/tests/toolchain/CsvBookToolchainTest.php#L242 I use the variable $pages but its value doesn't hold the number of pages, it holds the number of things in the book directory, which include a MODS.xml file for the book and . and ... This inconsistency could cause someone (like me, or anyone else) in the future undue investigation while debugging this test or using it as the basis for another one.

MarcusBarnes commented 6 years ago

@mjordan Please push the modification and I'll review today. Thanks!

MarcusBarnes commented 6 years ago

@mjordan @bondjimbond I've reviewed and run the unit tests. Let's go ahead and merge!

mjordan commented 6 years ago

Thanks!