LibreCat / Catmandu-MARC

Catmandu modules for working with MARC data
https://metacpan.org/release/Catmandu-MARC
Other
8 stars 10 forks source link

Catmandu::Fix::marc_map and Catmandu::Fix::Inline::marc_map substrings with split #32

Closed cKlee closed 8 years ago

cKlee commented 8 years ago

Within the current Catmandu::MARC::Fix::Inline::marc_map substrings are ignored when option split is present:

marc_map('245/0-1', split:1)

I altered Catmandu::MARC::Fix::Inline::marc_map thus substrings are not ignored anymore. I didn't make a PR because I don't know if this behavior is wanted/breaks something.

But I detected an error in Catmandu::MARC::Fix::marc_map. With the above fix I get:

Oops! Global symbol "$0" requires explicit package name at (eval 319) line 1. Global symbol "$3" requires explicit package name at (eval 319) line 1. Unmatched right curly bracket at (eval 319) line 1, at end of line syntax error at (eval 319) line 1, near ";}"

phochste commented 8 years ago

It shouldn't throw an error of course, but I don't know what a combination of split and a substring means.

E.g.

# if
marc_map(245, split:1)     => [ "A nice","example","text" ]
# then what is 
marc_map(245/0-3,split:1)  => ["A ni" ] ?   Or => ["A ni","exam","text"] ?
phochste commented 8 years ago

It will be the first option then

cKlee commented 8 years ago

You're fast closing this issue. I would have voted for the second option. Since '245' is the same as '245abcdefg...' then '245/0-3' is the same as '245a/0-3b/0-3c/0-3d/0-3e/0-3f/0-3g/0-3...'.

phochste commented 8 years ago

Well, It depends on the order of operations. 245/0-3 = 245abcdefg/0-3 could also be true following reasoning. There are three operations in marc_map(245/0-3,split:1)

If it was the second option above then marc_map(245/0-3, split:0) and marc_map(245/0-3, split:1) would have very different effects for the substring. In the first case you would substring at the end of the process in the second one in the middle of the process. It would lead to very different values in the mapping. This feels very weird that the order changes when doing split:0 or split:1

It is also about use case. Give me the first 80 chars of 245 makes sense if it results in a string of max 80 chars

cKlee commented 8 years ago

It is also about use case. Give me the first 80 chars of 245 makes sense if it results in a string of max 80 chars

marc_map('245/0-79', my.firsteighty)

will do this. Right?

marc_map('245/0-79', my.firsteighty.$append)

same as above, but as an array.

What about usecase: Give me first char of every subfield? This is not possible then, right?

phochste commented 8 years ago

One has too choose. E.g. given this data:

245 $aAA$bBB$cCC
245 $dDD$eEE

Without any agreement on the ordering of calculations there are many possibilities for simple mappings. For instance for string operations:

marc_map(245,data)

results in

AABBCCDDEE

All the subfields are gathered, the 'data' field is by default treated as a string to which new data is appended. Adding a substring operation, there can be three possible ways to calculate the result:

marc_map(245/0-1,data)

One, the substring can work on the final total string:

A

Two, the substring could work on the intermediate results for each 245 operation:

AD

Three, the substring could work on each subfield seperately:

ABCDE

In the current version of Catmandu-MARC it will do option 2.

Adding a split function you can have six possible outcomes.

marc_map(245,data,split:1)

could generate, like for strings, an array of all subfields

[ AA, BB, CC, DD, EE ]

or do it for each field separately:

[ [ AA, BB, CC ] , [ DD, EE] ]

It was always the second option in Catmandu-MARC but I prefer the first way (treat arrays like in the string case).

Adding a substring you get to six possibilities based on the data above

marc_map(245/0-1,data,split:1)

One, substr in case 1 above on the total result:

[ A ]

Two, substr in case 1 above on all the separate array items:

[ A , B , C, D, E ]

Three, substr in case 1 above on the two separate fields:

[ A , D ]

Four, substr in case 2 on each of the array of array items:

[ [ A , B , C ] , [ D, E ] ]

Five, substr in case 2 on each separate parts, treated as a whole

[ [ A ] , [ D ] ]

Six, substr in case on on the array as a whole

[ [A] ]

In the current version of Catmandu-MARC it will do options Three. With or without split you get similar result. Without split you get the string "AD" with split you get the array [A,D].

I prefer to see MARC not like a data table where you can pick and choose fields but treat it as markup language where you gather first text and do operations on it later.

If you want all the control how the substr needs to be treated, then you need execute more commands. I don't see it helpfull packing all the different options ordering of commands into one mapping functions. With three repeated fields the situation gets worse and worse. In cases like these you need to use the marc_each bind to loop over all the marc fields to get better control over which (sub)fields you want to add to which array.

What I would like to get is that the common use cases should be easy to write. E.g.: give me a list of all the ISBN numbers in a record. Or all the 650-subjects.

cKlee commented 8 years ago

Ok then! Substrings are not ignored anymore when split is requested. This is fine, because I need this for fixed fields in marc_spec. How substrings are treated for variable fields and subfields isn't relevant for marc_spec. Issue is solved I think. Thanks!