gbv / Catmandu-PICA

Catmandu modules for working with PICA+ data
https://metacpan.org/release/Catmandu-PICA
Other
4 stars 4 forks source link

Double \x1D added when format is "normalized" #57

Closed powerriegel closed 6 years ago

powerriegel commented 6 years ago

Hello, I'm using your Perl module to convert Marc21 files into Pica+Files. We use the normalized format as this is directly supported by our library system.

#!/usr/bin/perl

use strict;
use utf8;
use warnings;

use PICA::Record;
use PICA::Writer;
use PICA::Field;

use Encode qw(encode decode);

my $writer =  PICA::Writer->new('tests/out.pica', format => 'normalized');
my $field =   new PICA::Field('021A');
my $record = new PICA::Record();

$field->add('a', 'Foo');
$field->add('d', 'Bar');

$record->appendif($field);

$writer->write('', $record);
$writer->write('', $record);
$writer->write('', $record);

$writer->end();

print "Pica file written";
1;

Produces the output:


021A aFoodBar


021A aFoodBar


021A aFoodBar

These rectangles between the records are \1xD (record separator) chars. CBS (library system) has problems if there are two \x1D and we need to remove one of them.

jorol commented 6 years ago

I can reproduce your problem, but you are using PICA::Record, an outdated/deprecated module, which is not supported by Catmandu. Please submit bug reports to https://github.com/gbv/PICA-Record. I recommend to use PICA::Data (CPAN, Github). It is actively maintained, developed and supports several PICA formats. A simple example for your use case:

#!/usr/bin/env perl

use strict;
use utf8;
use warnings;

use PICA::Writer::Plain;
use PICA::Writer::Plus;
use PICA::Writer::Binary;

my $record = {
    _id => '123',
    # a PICA record is an array of arrays
    # each PICA field array consists of a field tag, an occurrence 
    # and a sequence of subfield indicators and subfield values
    record => [
        [ '001U', '', '0', 'utf8' ],
        [ '021A', '', 'a', 'Foo', 'd', 'Bar' ]
    ]
};

my $writer_binary = PICA::Writer::Binary->new('out_binary.pica');

$writer_binary->write($record);
$writer_binary->write($record);
$writer_binary->write($record);

my $writer_plain = PICA::Writer::Plain->new('out_plain.pica');

$writer_plain->write($record);
$writer_plain->write($record);
$writer_plain->write($record);

my $writer_plus = PICA::Writer::Plus->new('out_plus.pica');

$writer_plus->write($record);
$writer_plus->write($record);
$writer_plus->write($record);

print "Pica files written\n";
powerriegel commented 6 years ago

Ok, I've tried it with your code example. out_plus.picalooks like this: `001U 0utf8021A aFoodBar

001U 0utf8021A aFoodBar

001U 0utf8021A aFoodBar` So, there are no set separators and no field separators.

The Binary file might be accepted by CBS but it's not human readable. Isn't there a way to add those fields in the plus format?

jorol commented 6 years ago

out_plus.pica contains unit separators (0x1F) and record separators (0x1E) (you can see this in the edit mode or with an hex viewer), the line feed is used as group separator. The plus and binary format are not designed for human readability, use plain for this. I will check if we could implement a generic writer, where everyone can set his own separators.