MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Mysterious failures in the CSV Newspaper toolchain #519

Open bondjimbond opened 1 year ago

bondjimbond commented 1 year ago

I'm hitting errors that I can't figure out with some objects I'm processing.

Five of the Newspaper Issues in this set error out: record numbers 34, 37, 71, 72, and 76.

Their metadata is no different from the pages that process without issue: Screen Shot 2022-07-19 at 11 17 34 AM

No unusual characters that aren't present in other records.

And the file names all seem to be correct and appropriate: Screen Shot 2022-07-19 at 11 20 56 AM

For the issues with errors, the entire issue fails to appear.

The log doesn't tell me much. Some of the "problem records" don't seem to actually show errors in the log at all (e.g. records 34 and 37). This bit for record 71 shows problems, but it's hard to understand. I do see some strange things here:

"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB

I checked, and we do have hidden files called ._2002-11-01-003.tif etc. This could be part of the problem, except that such files do not appear in the other directories. The other directories seem to be OK.

Is there anything obvious I'm missing?

[2022-07-19 15:07:50] ErrorException.ERROR: ErrorException {"message":"mkdir(): File exists","code":{"metadata":"<?xml version=\"1.0\"?>\n<mods xmlns=\"http://www.loc.gov/mods/v3\" xmlns:mods=\"http://www.loc.gov/mods/v3\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n  <titleInfo>\n      <title>\n\tLE BASTION DE NANAIMO, 2002\n  </title>\n    \n  </titleInfo>\n  <typeOfResource>text</typeOfResource>\n  <language>\n    <languageTerm type=\"text\">French</languageTerm>\n  </language>\n  <physicalDescription>\n\t\t\t<extent>nov.-d&#xE9;c. ; pp. 12</extent>\n\t\t\t\t</physicalDescription>\n  <location>\n\t\t\t</location>\n  <originInfo>\n    \t<dateIssued keyDate=\"yes\">2002-11-01</dateIssued>\n    \t<publisher>Association des Francophones de Nanaimo</publisher>\n    \t</originInfo>\n  <genre authority=\"marcgt\">newspaper</genre>\n  <subject>\n  <hierarchicalGeographic>\n                        </hierarchicalGeographic>\n</subject>\n  <part>\n  </part>\n</mods>\n\n","pages":["/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-004.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-005.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-006.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-007.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-008.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-009.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-010.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-011.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-012.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-013.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-014.tif"],"record_id":"71","no_datastreams_setting_flag":true,"file_name_field":"Directory","record":"[object] (stdClass: {\"key\":\"71\",\"Directory\":\"2002-11-01\",\"IssueTitle\":\"LE BASTION DE NANAIMO, 2002\",\"Type\":\"text\",\"Genre\":\"newspaper\",\"Date_Issued\":\"2002-11-01\",\"Language\":\"French\",\"localIdentifier\":\"\",\"Publisher_Place\":\"\",\"Publisher\":\"Association des Francophones de Nanaimo\",\"PhysicalLocation\":\"\",\"Extent\":\"nov.-déc. ; pp. 12\",\"note\":\"\",\"rightsstatement\":\"\"})","issue_level_output_dir":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71","issue_level_input_dir":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01","MODS_expected":false,"metadata_file_path":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/MODS.xml","page_path":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-001.tif","pathinfo":{"dirname":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01","basename":"._2002-11-01-001.tif","extension":"tif","filename":"._2002-11-01-001"},"filename_segments":["._2002","11","01","001"],"page_number":"1","page_level_output_dir":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/1","OBJ_expected":false,"extension":"tif","page_output_path":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/1/OBJ.tif","OCR_expected":false,"ocr_input_path":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-001.txt","ocr_output_path":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/1/OCR.txt"},"severity":2,"file":"/Users/brandon/sfuvault/mik/src/writers/CsvNewspapers.php","line":134} []
[2022-07-19 15:07:50] ErrorException.ERROR: ErrorException {"message":"problem writing package","record_key":"71","details":"[object] (mik\\exceptions\\MikErrorException(code: 0):  at /Users/brandon/sfuvault/mik/mik:105)"} []

Am I missing anything obvious that might account for the problem?

problem_records.log mik.log shfcb_37.csv bchdp_news.txt

bondjimbond commented 1 year ago

I have traced these failures to several different factors:

We should of course be getting clean data from digitizers, but that clearly can't be relied upon. Can the toolchain be modified to avoid these things? For example, it shouldn't get tripped up by uppercase vs lowercase file extensions... we could account for that. And the hidden files should be avoidable, because it's only looking for filenames that match the directory name, right?

mjordan commented 1 year ago

Thumbs.db is skipped in some toolchains:

mark@user-ThinkPad-X1-Carbon-6th:/tmp/mik$ grep -ri thumbs *
extras/scripts/check_files.php:    // belong (I'm looking at you thumbs.db). Get a list of all files in the
extras/scripts/remove_files.php:    'Thumbs.db',
extras/scripts/shutdownhooks/create_structure_files.php:  $exclude_array = array('..', '.DS_Store', 'Thumbs.db', '.');
extras/scripts/shutdownhooks/create_structure_files.php:  $exclude_array = array('..', '.DS_Store', 'Thumbs.db', '.');
src/filegetters/CsvCompound.php:            if (preg_match('/thumbs\.db/i', $pathinfo['basename'])) {
src/inputvalidators/CsvBooks.php:            'Thumbs.db',
src/inputvalidators/CsvBooks.php:            '.Thumbs.db',
src/inputvalidators/CsvBooks.php:            // Book directory cannot contain Thumbs.db, etc.
src/inputvalidators/CsvCompound.php:            'Thumbs.db',
src/inputvalidators/CsvCompound.php:            '.Thumbs.db',
src/inputvalidators/CsvCompound.php:            // Compound directory cannot contain Thumbs.db, etc. The CsvCompound
src/inputvalidators/CsvCompound.php:            // because of the presence of Thumbs.db, etc.

Maybe try to run the remove_files.php script to get rid of them? I think the case of the extension doesn't matter, but I haven't checked the code to confirm that. And yes, we can modify the toolchain to skip them.

bondjimbond commented 1 year ago

I've been going through my massive batch of files to process, and while MIK isn't showing any actual errors in the log anymore after I removed all those hidden files, there is still a set that aren't processing and do show up in the problem_records file.

The only difference between these files and the others is the uppercase .TIF extension.

After changing the extension from uppercase to lowercase, it processes correctly.

mjordan commented 1 year ago

:+1: then let's make the file extension case irrelevant!

bondjimbond commented 1 year ago

Do you know offhand which file needs to be edited for this? I can give it a shot.

MarcusBarnes commented 1 year ago

@bondjimbond Here's where the tiff file extensions for newspapers are defined: https://github.com/MarcusBarnes/mik/blob/dd26fa8ff7892f1cca61c6e66e0fc90235fb23b1/src/filegetters/CdmNewspapers.php#L49

Probably the best way to approach this is where ever else $allowed_file_extensions_for_OBJ is used, use strtolower when doing the file extension comparison check. This would handle other cases where there is inconsistent file extension naming like .Tif, .tiFF, etc.

MarcusBarnes commented 1 year ago

The quicker fix is just to add TIFF and TIF to $allowed_file_extensions_for_for_OBJ.

bondjimbond commented 1 year ago

@MarcusBarnes Perfect, thanks!