MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Delimiter problem in CSV Newspapers toolchain? #512

Closed bondjimbond closed 2 years ago

bondjimbond commented 2 years ago

Strange error coming up when I try to process a Newspaper CSV. I can't understand what's wrong with my delimiters. It looks like there's some hidden issue in my CSV file perhaps. Any ideas?

[2021-08-16 15:29:55] ErrorException.ERROR: ErrorException {"message":"preg_match(): Delimiter must not be alphanumeric or backslash","code":{"record_key":"46","item_info":"[object] (stdClass: {\"key\":\"46\",\"Directory\":\"1946-11-20\",\"Identifier\":\"ASMN-001216\",\"Title\":\"Abbotsford Sumas & Matsqui News, November 20, 1946\",\"Date\":\"1946-11-20\"})","issue_directory":"1946-11-20","directory_regex":"\\#/1946\\-11\\-20/\\#","paths":["/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-001.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-002.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-003.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-004.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-005.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-006.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-007.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-008.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-009.tif","/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-010.tif"],"path":"/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-001.tif"},"severity":2,"file":"/Users/brandon/sfuvault/mik/src/filegetters/CsvNewspapers.php","line":161} []
[2021-08-16 15:29:55] ErrorException.ERROR: ErrorException {"message":"problem getting children of record","record_key":"46","details":"[object] (mik\\exceptions\\MikErrorException(code: 0):  at /Users/brandon/sfuvault/mik/mik:105)"} []

CSVs.zip newspapers.ini.zip

MarcusBarnes commented 2 years ago

To possibly assist with debugging, note that the error created by preg_match() happens here: https://github.com/MarcusBarnes/mik/blob/12cdadd9a6e334043679f064f32ecd6c146842b7/src/filegetters/CsvNewspapers.php#L161

bondjimbond commented 2 years ago

@MarcusBarnes do you think it's a code problem, or likely something I'm missing in the .ini file? I can't see anything wrong with my CSV.

MarcusBarnes commented 2 years ago

Not sure yet. I'll share if I have any leads to for you to follow up on.

bondjimbond commented 2 years ago

Is it applying the directory regex to the metadata for some reason?

\"key\":\"46\",\"Directory\":\"1946-11-20\",\"Identifier\":\"ASMN-001216\

It looks like it's inserting the DIRECTORY_SEPARATOR in between the metadata keys and values for reasons I can't grasp.

MarcusBarnes commented 2 years ago

@bondjimbond Based on the second error message, would you check that the filename and access permissions on /Volumes/UFV_FILES/UFV-ASMN-1946/1946/ that corresponds to key 46 are all correct and that there's nothing there that is suspicious?

Another approach to debugging is adding some print_r statements just before the https://github.com/MarcusBarnes/mik/blob/12cdadd9a6e334043679f064f32ecd6c146842b7/src/filegetters/CsvNewspapers.php#L161 for $path and $directory_regex see if there's anything there that would cause problems? For $directory_regex, we use '#' as the regex delimiter - is that interacting with the file path corresponding to key 46 (or key 45) in some way?

bondjimbond commented 2 years ago

Hm, I tried print_r both inside and outside of the loop, but nothing printed.

bondjimbond commented 2 years ago

File permissions all look OK: -rwxrwxrwx

MarcusBarnes commented 2 years ago

@bondjimbond Try using dump() instead? https://github.com/MarcusBarnes/mik/blob/master/src/utilities/Dumper.php

bondjimbond commented 2 years ago

How does that work? If I try dump($path) I get fatal error: Uncaught Error: Call to undefined function mik\filegetters\dump()

bondjimbond commented 2 years ago

OK, var_dump worked. Here are the results for $path and $directory_regex:

string(67) "/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-001.tif"
string(18) "\#/1946\-01\-09/\#"
string(67) "/Volumes/UFV_FILES/UFV-ASMN-1946/1946/1946-01-09/1946-01-09-001.tif"
string(18) "\#/1946\-01\-16/\#"
[etc]
bondjimbond commented 2 years ago

So based on this error: Delimiter must not be alphanumeric or backslash, is MIK interpreting the backslash that is added by the regex function as a delimiter rather than as an escape character?

bondjimbond commented 2 years ago

The last time I sued CsvNewspapers it worked -- wondering if the problem comes from a commit between then and now?

I see there's a commit that affects the directory path here: https://github.com/MarcusBarnes/mik/commit/47884d71ad16c44a4dbbbc98cf58edd0879cdc55#diff-35c8aa40361be605efe8d8fa1a65bf7affad5853bfa9e003c5d318196b7f4f8c

Wondering if that might be the cause..

bondjimbond commented 2 years ago

No -- tried reverting that file to the older state, continuing to get the same error.

I also tried rewriting the .ini file from scratch, no change.

bondjimbond commented 2 years ago

The problem has to be here, right? "directory_regex":"\\#1946\\-01\\-09\\#"

I added a before-and-after var_dump to see what happens to the $directory_regex variable here:

        $directory_regex = '#' . DIRECTORY_SEPARATOR . $issue_directory . DIRECTORY_SEPARATOR . '#';
var_dump($directory_regex);
        $directory_regex = preg_quote($directory_regex);
var_dump($directory_regex);

and this is what I see:

string(14) "#/1946-12-11/#"
string(18) "\#/1946\-12\-11/\#"
string(14) "#/1946-12-18/#"
string(18) "\#/1946\-12\-18/\#"

While in mik.log the backslash -- which is intended to escape the hyphen -- is doubled up. So it looks like the escaping happens twice somehow, which is probably the cause of this problem.

bondjimbond commented 2 years ago

SUCCESS

I removed the preg_quote line, and now MIK runs successfully.