MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Ignore/skip Thumbs.db files in CsvBooks toolchain #449

Open bondjimbond opened 6 years ago

bondjimbond commented 6 years ago

When packages are created in Windows, you get a lot of directories with unwanted files called Thumbs.db. MIK does not like these; you end up with errors like this:

[2018-02-07 21:31:03] input validator.ERROR: Input validation failed {"record ID":"16","book object directory":"/Volumes/Arca/DOH_FILES/arms_cheerio_pt1/ARMS_04_1943","error":"Book object input directory contains unwanted files"} []

In a large package of directories it's not easy to delete these all manually. It would be nice if MIK could skip them.

bondjimbond commented 6 years ago

Note that on a Mac at least, it actually is easy enough to remove all these files with a command:

find /path/to/tree -name 'Thumbs.db' -delete

But this requires that you know they're there, that you know they're the problem, and that you know how to do it. Since this is likely going to be a very common problem with any image set created in Windows, it's best if MIK knows how to deal with it.

MarcusBarnes commented 6 years ago

There is a helper script to remove unwanted files for an input directory like Thumbs.db: https://github.com/MarcusBarnes/mik/blob/master/extras/scripts/remove_files.php I haven't had a chance to test on Windows.

mjordan commented 6 years ago

Agreed. There's a cookbook entry for dealing with this, and the iipqa can detect them, but MIK should skip them. There are equivalent unwanted files on Macs, so we should include those as well:

.Thumbs.db Thumbs.db .DS_Store DS_Store

We would only need to add this list to the base filegetter class, I think, and then in each filegetter, reference the list. Thumbs.db can appears in any directory that contains image or video (maybe other) files, and the DS_Store can appear anywhere I think, so this applies not just to Books.

mjordan commented 6 years ago

The REST Ingester skips these: https://github.com/mjordan/islandora_rest_ingester/blob/master/includes/Ingester.php#L30

bondjimbond commented 6 years ago

MIK does already appear to skip .DS_Store automatically; I've always had those in my packages without problems.

mjordan commented 6 years ago

We should test whether they show up in Windows, e.g., the files are written by OSX but if you run MIK Windows using that input, they might show up. I might be mistaken. Wouldn't hurt to have a list and just get the filegetters (or any other component that needs to) to skip every file in that list.

bondjimbond commented 6 years ago

Wouldn't hurt to have a list and just get the filegetters (or any other component that needs to) to skip every file in that list.

Agreed. I don't think we need to bother testing what Windows does, in that case; just tell it to skip that list of garbage files every time.

mjordan commented 6 years ago

We should be able to write PHPUnit tests for this feature pretty easily.