LibreCat / Catmandu-MARC

Catmandu modules for working with MARC data
https://metacpan.org/release/Catmandu-MARC
Other
8 stars 10 forks source link

$append isn't working in versions > 1.x #49

Closed jorol closed 7 years ago

jorol commented 7 years ago

Error:

$ catmandu convert MARC to YAML --fix 'marc_map(700a,my.authors.$append);remove_field(record)' < camel.mrc
---
_id: 'fol05865967 '
my:
  authors:
  - Christiansen, Tom.Orwant, Jon.
...

Result with version 0.219:

---
_id: 'fol05865967 '
my:
  authors:
  - Christiansen, Tom.
  - Orwant, Jon.
...

Please add a test for marc_map after fixing.

phochste commented 7 years ago

There is a change in Catmandu::MARC v0.* and v1.* handling of repeated fields.

It is quite confusing in MARC one has repeated fields, repeated subfields. In Catmandu one can create strings, arrays. And all these things can be combined also.

In version 1.x these things are split up:

1) There is the world of Catmandu, after you process the MARC fields, where do you want to store the data? In a string or in an array. For this you use:

marc_map(700a,my.authors)

or

marc_map(700a,my.authors.$append)

2) How you you want the MARC fields processed. Without any options Catmandu::MARC tries as best as it can to create one string out of the data found.

 marc_map(700a,my.authors)  == string
 marc_map(700a,my.authors.$append) == string appended to an array

With options it can try to make an array of repeated fields.

 marc_map(700a,my.authors, split:1) == list
 marc_map(700a,my.authors.$append, split:1) == list and put into an array

If you not only want the repeated fields but also the repeated subfields in a array, then you need to add the nested_arrays option:

 marc_map(700a,my.authors, split:1, nested_arrays:1 ) == list of lists
 marc_map(700a,my.authors.$append, split:1) == list of lists put into an array

This was done to make life simpler for most cases, and still provide a way to get all the fields. E.g. extracting a list of all ISBN numbers from a MARC should be easy in Catmandu without needing to parse lists in list.

jorol commented 7 years ago

Thanks!

cKlee commented 7 years ago

@phochste Just another question to this. Why did you introduce the append option like https://github.com/LibreCat/Catmandu-MARC/blob/master/lib/Catmandu/MARC.pm#L39 ? I thought the $append functionality coming from Catmandu::Util works correctly. And what about $prepend. I'm a bit confused.

phochste commented 7 years ago

I think I need to rename this option to get rid of the confusion. It is an internal parameter to make the JSONpath '$append' work as advertised.

When you have an $append in your JSON path, you want to have an ARRAY as output, whatever the MARC processor did.

This -append option should better be called -force_array or something. And you are right, I forget to do this also for JSON paths with $prepend: they also need a -force_array to really create arrays (whatever the output of the MARC processor).

cKlee commented 7 years ago

I'm sorry to stress this once again. I re-read the mapping rules in the wiki. And for me it looks like a little bit inconsistent. Maybe you could enlighten me why you choose this behaiviour:

We have MARC data

245  $aTitle / $cName
500  $aA$aB$aC$xD
650  $aAlpha
650  $aBeta
650  $aGamma
999  $aX$aY
999  $aZ

Ok. now fix:

marc_map(500,note.$append)

outputs

note: [ "ABCD" ]

That's fine I think. When assuming that var note is already an array with an element like:

note: ["XXXXX"]

then the fix

marc_map(500,note.$append)

would result into

note: ["XXXXX",  "ABCD" ]

due to the $append, right?

Now the same wit repeated fields. Mapping rules say

marc_map(650a,subject.$append)

must result into

subject: [ "Alpha", "Beta" , "Gamma" ]

And this is where I'm confused. If

marc_map(650a,subject)

outputs

subject: "AlphaBetaGamma"

and

marc_map(650a,subject, split:1)

outputs

subject: [ "Alpha", "Beta" , "Gamma" ]

shouldn't then

marc_map(650a,subject.$append)

result in

subject: ["AlphaBetaGamma"]   

???

As I understand the $append functionality, it does not control the data structure which is appended to the array. It only appends the data (independent of its structure) to the existing array (or creates a new array first).

phochste commented 7 years ago

Send version 1.09 to CPAN

cKlee commented 7 years ago

Does -force_array has any impact?

https://github.com/LibreCat/Catmandu-MARC/blob/master/lib/Catmandu/MARC.pm#L39

and then

https://github.com/LibreCat/Catmandu-MARC/blob/master/lib/Catmandu/MARC.pm#L146-L148

??

phochste commented 7 years ago

@cKlee Processing MARC you can have these types of situations:

These are the raw results of the parsing. The marc_path options split , join and nested_arrays decide how to process these arrays-of-arrays.

These results above are then send to the JSONpath processing you can see this as the temporary result. For instance, the 650a processing leads to this result (before JSONPath processing):

_tmp:
      - Alpha
      - Beta
      - Gamma

The JSONPath operation is nothing more than:

move_field(_tmp.*, subject.$append)
remove_field(_tmp)

subject:
     - Alpha
     - Beta
     - Gamma

But it would lead into strange results when you don't use the append options:

move_field(_tmp.*, subject)
remove_field(_tmp)

subject: Alpha

That is why for string the join function is used for each subfield separate plus on each field to get

subject: AlphaBetaGamma
phochste commented 7 years ago

@cKlee On the second question. Yes -force_array has an impact on Inline Perl processing of marc fixes which can give two kinds of results if you want arrays as output or not:

my @array = marc_map($data,"650a");
my $str      = marc_map($data,"650a");

The -force_array mimics the absence of the JSONPath language in these inline fixes

cKlee commented 7 years ago

@phochste Seems logical for me what you wrote, but I really want to understand this:

For the marc path ...acx on data

245  $aTitle / $cName
500  $aA$aB$aC$xD
650  $aAlpha
650  $aBeta
650  $aGamma
999  $aX$aY
999  $aZ

what does the 'situation' looks like? Is it this

[
    ["Title", "Name"],    # field 245
    ["A", "B", "C", "D"], # field 500
    ["Alpha"],            # first field 650
    ["Beta"],             # second field 650
    ["Gamma"],            # last field 650
    ["X","Y"],            # first field 900
    ["Z"]                 # last field 900
]

I understand that the join operation is the default behavior. Therefore the fix

marc_map(...acx, data)

should become

data: "TitleNameABCDAlphaBetaGammaXYZ"

right? What about the fix

 marc_map(...acx, data.$append)

Should this become

data: ["TitleName","ABCD","Alpha","Beta","Gamma","XY","Z"]

or

data: ["TitleNameABCDAlphaBetaGammaXYZ"]

?

The latter seems a logical choice for me, because $append appends the joined string as an array element. But this conflicts with the mapping rules like in this fix:

marc_map(650a,subject.$append) --> subject: [ "Alpha", "Beta" , "Gamma" ]

On the other possibility $append is doing something unexpected with the data. So, where is my error in reasoning?

phochste commented 7 years ago

Yes the join behavior is default. Starting with:

[
    ["Title", "Name"],    # field 245
    ["A", "B", "C", "D"], # field 500
    ["Alpha"],            # first field 650
    ["Beta"],             # second field 650
    ["Gamma"],            # last field 650
    ["X","Y"],            # first field 900
    ["Z"]                 # last field 900
]

the inner arrays will be joined with de join:"" to :

[ "TitleName" ,
  "ABCD",
  "Alpha",
  "Beta",
  "Gamma",
  "XY",
  "Z"
]

Now the mapping should follow. When the mapping is to a string marc_map(...acx,data) this array is joined again with the default join:"" to:

 data: TitleNameABCDAlphaBetaGammaXYZ

If the data already contained data, it will be overwritten.

If the mapping is to an array marc_map(...acx,data.$append), like in the previous examples, the array values are copied and appended to the array:

data:
  - TitleName
  - ABCD
  - Alpha
  - Beta
  - Gamma
  - XY
  - Z

If the dataalready contained array items, then this array will become bigger.

The examples are all the same. It mimics a use case to ease for instance getting all the ISSN and ISBN numbers form a record in a list into:

marc_map(02.a,isxn.$append)

Instead of doing complicated parsings of strings in arrays.

cKlee commented 7 years ago

@phochste Although there is something happening with $append, which I would not expect primarily, I'll follow you here. There is an inner logic in your approach, but this has to be made more explicit. Therefore I like to extend the wiki page with some of the explanations you gave in this thread. And maybe the fix PODs should refer to the wiki mapping rules page.

Just to satisfy my inquisitiveness: What would be the fix for the 900 field, when I want append the string "X; Y; Z" to an array?

arr: [ "X; Y; Z"]

Obviously I have to use the -join option with '; ' . But I guess this could not be achieved wit a single fix, right?

phochste commented 7 years ago

Good idea!

Currently you can't do this in one fix indeed. This has to do with the ambiguity of the join parameter. Does is work on the inner most array (the subfields) or the outer most array (the fields)? Answer: both, one can't differentiate between them.

I can image a future version of the fix that could get to your result with a combination of a split and an explicit join:

marc_map(900a,split:1, join:";")

and this is not that hard to create. E.g. by writing in marc_map instead of

  if ($split) {
        $vals = [ $vals ];
    }

this

   if ($split) {
        $vals = [ $vals ];
        $vals = [[ join($join_char,@{$vals->[0]}) ]]  if $join_char;
   }

Or, we can make it very explicit with inner-join , outer-join options. Tricky to balance easy use-cases against, easy syntax

cKlee commented 7 years ago

Hm? These are somehow too many options I think. While thinking about this: what about the nested_arrays option. This option does only work in conjunction with the split option right? So why do we need the split option set anyway? I mean instead of

marc_map(999a,local, split:1, nested_arrays:1)

a simple

 marc_map(999a,local, nested_arrays:1)

or just

marc_map(999a,local, nested:1)

should be sufficient, right?

cKlee commented 7 years ago

I'll merge the wiki rules pages like this, if it's ok with you:

Single field, no subfield repetition

Fix Result
marc_map(245, title)
marc_spec(245, title)
title: "Title / Name"
marc_map(245a, title)
marc_map(245$a, title)
marc_spec(245$a, title)
title: "Title / "
marc_map(245ac, title)
marc_map(245$a$c, title)
marc_spec(245$a$c, title)
title: "Title / Name"
marc_map(245ca, title)
marc_map(245$c$a, title)
marc_spec(245$c$a, title)
title: "Title / Name"
marc_map(245ca, title, pluck:1)
marc_map(245$c$a, title, pluck:1)
marc_spec(245$c$a, title, pluck:1)
title: "NameTitle / "
marc_map(245ca, title, pluck:1, join:" ")
marc_map(245$c$a, title, pluck:1, join:" ")
marc_spec(245$c$a, title, pluck:1, join:" ")
title: "Name Title / "
marc_spec(245$a, title, invert:1) title: "Name"
phochste commented 7 years ago

Yes, the nested_arrays:1 should work with or without a split. I think this is a bug that one needs to write also explicit split:1 to make this work. For the wiki, I'm all ok for making it more intuitive to read.

cKlee commented 7 years ago

While working on the wiki, I saw the rule

marc_map(...a,all.$append)  --> all: [ "Title / ABCAlphaBetaGammaXYZ" ]

But it should be

all:
  - Title / 
  - ABC
  - Alpha
  - Beta
  - Gamma
  - XY
  - Z

right?

cKlee commented 7 years ago

These are the rules I figured out to be stated at the beginning of the wiki mappings rules page:

When mapping MARC data to fix variables, Catmandu::MARC will do this on the basis of two rules:

Rule 1

If option split or nested_arrays is not set to 1, Catmandu::MARC will join values at the innermost array(s).
Depending on the given data structure, the join will do something like

Rule 2

If option split or nested_arrays is set to 1 or if .$append or .$prepend is suffixed to the fix variable, Catmandu::MARC will always map data as an array. Otherwise data will become a joined string of all array elements.

@phochste Makes that sense?

cKlee commented 7 years ago

@phochste I'm playing around with the mappings_rules tests, to add some more unusual cases. This one I crated to test if $append is working when the variable is already an array:

note 'marc_map(245,title.$append)     title: ["first", "Title / Name" ]';
{
    my $importer = Catmandu->importer(
        'MARC',
        file => \$mrc,
        type => 'XML',
        fix  => 'set_field(title.$first, "first"); marc_map(245,title.$append); retain_field(title)'
    );
    my $record = $importer->first;
    is_deeply $record->{title}, ['first', 'Title / Name'], 'marc_map(245,title.$append)';
}

but I got

not ok 2 - marc_map(245,title.$append)
#   Failed test 'marc_map(245,title.$append)'
#   at t/test_22-mapping_rules.t line 65.
#     Structures begin differing at:
#          $got->[1] = Does not exist
#     $expected->[1] = 'Title / Name'

DataDumper on $record gives me:

\ {
    title   [
        [0] "first"
    ]
}

So 'Title / Name' is lost. Or has my fix an error?

Happy Weekend!

cKlee commented 7 years ago

@phochste Ah! My mistake. Instead of set_field it must be add_field. But why is set_field not working?

phochste commented 7 years ago

My laptop is in repair so I can't test a lot today. But set_field can only update an existing field : title.$first(aka title.0) doesn't exist and can be set.

cKlee commented 7 years ago

@phochste Sorry, sorry, sorry, but ...

Should not

marc_map(500,note.$append, join_char:'#')

become

note: ['A#B#C#D']

and not

note: ['ABCD']

??

phochste commented 7 years ago

@cKlee Yes it does in the current version of Catmandu. Doesn't it at your end?

marc_map(500,note.$append,join:'#')

gives :

note:

cKlee commented 7 years ago

Nope. Catmandu v1.05 and Catmandu::MARC 1.09

note 'marc_map(500,note.$append, join_char:"#")  note: [ "A#B#C#D" ]';
{
    my $importer = Catmandu->importer(
        'MARC',
        file => \$mrc,
        type => 'XML',
        fix  => 'marc_map(500,note.$append, join_char:"#"); retain_field(note)'
    );
    my $record = $importer->first;
    is_deeply $record->{note}, ['A#B#C#D'], ' marc_map(500,note.$append, join_char:"#")';
}

gives me

# marc_map(500,note.$append, join_char:"#")  note: [ "A#B#C#D" ]
not ok 5 -  marc_map(500,note.$append, join_char:"\#")
#   Failed test ' marc_map(500,note.$append, join_char:"\#")'
#   at t/test_22-mapping_rules.t line 429.
#     Structures begin differing at:
#          $got->[0] = 'ABCD'
#     $expected->[0] = 'A#B#C#D'
phochste commented 7 years ago

@cKlee :) join_char:"#" --->join:"#"

cKlee commented 7 years ago

:flushed: sorry and thanks!