MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

CsvBooks doesn't like periods in filenames #498

Open bondjimbond opened 5 years ago

bondjimbond commented 5 years ago

I got a bunch of files with names like this: 1987.0019.0039.0001.TIF

I set page_sequence_separator = .

Result:

[2019-02-13 18:44:34] input validator.ERROR: Input validation failed {"record ID":"1","book object directory":"/Volumes/Barkerville/RG136/books/1","error":"Some files in the book object directory have invalid extensions"} []

If I change the filename: 1987.0019.0039_0001.tif and set page_sequence_separator = _, it works.

Hypothesis: using a period as a page sequence separator doesn't work because MIK reads ".tif" as the page rather than the extension.

Alternate hypothesis: MIK is reading .0019.0039.0001.tif as a file extension rather than as part of a filename followed by a page number followed by extension.

Is it possible to fix this?

mjordan commented 5 years ago

I think this is more to do with using TIF as an extension that using a period. The validation error is coming from https://github.com/MarcusBarnes/mik/blob/master/src/inputvalidators/CsvBooks.php#L140. Coincidentally I opened a PR (#496) two weeks ago that will let you indicate TIF as a valid file extension.

Just to be sure, can you change the filenames back to using periods as separators but leave the (currently invalid until we merge #496) TIF extension as is and see what happens?

mjordan commented 5 years ago

Sorry, my logic is wrong. Change the extension to tif and try using periods as separators.

bondjimbond commented 5 years ago

OK, a couple of tests:

Separator _, extension tif = success Separator _, extension TIF = fail Separator ., extension TIF = fail Separator ., extension tif = success

So it looks like the uppercase/lowercase extension is the culprit.

But interestingly, it behaves very unpredictably in my set -- because the first record is not .0001.tif but instead .0000.tif.

The presence of this file results in weird behaviours. Typically directory 0 just isn't generated, but it can also cause MIK to skip, say, directory 2, or more directories.

mjordan commented 5 years ago

@bondjimbond when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?

bondjimbond commented 5 years ago

when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?

Yes... I was running with a test directory of just four images (0000.tif through 0003.tif).

The first few runs, it produced directories 1, 2, and 3.

Later runs, just 1 and 3.

Another later run, just 3.

After removing 0000.tif, it consistently produced 1, 2, and 3.

mjordan commented 5 years ago

Could you zip up all your data and config files and send them to me so I can try to replicate that?

bondjimbond commented 5 years ago

Sure, here's my test directory and ini file: https://vault.sfu.ca/index.php/s/ChGaez7NLOygY3w

mjordan commented 5 years ago

OK, got it, I'll give it a try this evening.

mjordan commented 5 years ago

Can you send me your mappings file?

bondjimbond commented 5 years ago

Ack, of course, sorry barkerville_mapping.txt

mjordan commented 5 years ago

@bondjimbond Strangely, I can't replicate the behavior you are seeing. I ran MIK about 10 times and always got the same thing: page objects for pages 1-3 and an error indicating a problem with the 0000 file ([...] added by me):

"message":"mkdir(): File exists" [...] "filename_segments":["1987","0019","0039","0000"],"page_number":""

The problem is coming from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvBooks.php#L132-L134: since we trim all left padding 0s, we need something other than a 0000 as the page number. I'm not sure a fix to allow 0000 as a page number would be trivial.

mjordan commented 5 years ago

Although a check to see if $page_number is an empty string, and if it is, assign it a value of 0 to create a 0 directory, might be a simple fix. But do you want 0 to be the first page number instead of 1?

Something like:

$page_number = ltrim(end($filename_segments), '0');
if (strlen($page_number) === 0) {
  $page_number = '0';
}
$page_level_output_dir = $book_level_output_dir . DIRECTORY_SEPARATOR . $page_number;
mkdir($page_level_output_dir);
mjordan commented 5 years ago

Just tried that, it worked:

/tmp/brandon_books/
└── 1987
    ├── 0
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 1
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 2
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 3
    │   ├── MODS.xml
    │   └── OBJ.tif
    └── MODS.xml

MODS.xml for page 0 is:

  <titleInfo>
    <title>This is a title, page 0</title>
  </titleInfo>
</mods>

MODS.xml for page 1 is:

  <titleInfo>
    <title>This is a title, page 1</title>
  </titleInfo>
</mods>
bondjimbond commented 5 years ago

That's exactly what I need! :)

mjordan commented 5 years ago

OK, I can open a PR for this if you want.

bondjimbond commented 5 years ago

Please do!

mjordan commented 5 years ago

I've made the same change to the CsvNewspapers writer and pushed up the issue-498 branch. I'll need to assemble some test data later but once I do that I'll open a PR.