gbv / Catmandu-PICA

Catmandu modules for working with PICA+ data
https://metacpan.org/release/Catmandu-PICA
Other
4 stars 4 forks source link

Module does not provide support for the normalized CBS pica format with delimiters #78

Closed powerriegel closed 1 year ago

powerriegel commented 2 years ago

Our CBS uses a pica+ format with delimiters \x1d, \x1e, \x1f which is still human readable, i.e. with line breaks after each field. Catmandu seems not to be able to produce this format. We use this format for many years because it can be directly loaded into CBS but is still human readable which makes it easier to find errors.

Digi20-20220118.0.zip

jorol commented 2 years ago

The "Generic" exporter should support this. You can install it with:

$ cpanm Catmandu::PICA

Please be aware that we did a major release of PICA::Data with some breaking changes. Perhaps you should install older versions:

$ cpanm PICA::Data@1.35
$ cpanm Catmandu::PICA@1.07

You could also use sed to create such format:

PICA Binary:

picabinary.txt

Add line breaks after '\x1d' and '\x1e' with sed:

$ sed -E 's/\x1e/\x1e\n/g;s/\x1d/\x1d\n/g' picabinary.txt > picabinary_with_linebreaks.txt 
powerriegel commented 2 years ago

Its correct, that the latest version is unable to create CBS compliant normalized PICA+ format. In PICA::Data version 2.02, Catmandu::PICA version 1,08, generic format works in this version but creates a file without delimiters.

Working with older versions is no good idea on productive environment, as you sometimes have to upgrade and then introduce this problem again...

PICA Plus Format is defined here https://wiki-cbs.oclc.org/wiki/images/Software_for_Data_Import.pdf, Kap. 2.1)

All data is on lines of text, terminated by a single LF character (ASCII 10); this may be omitted on the last line

and

If the record is not empty, the first character on the next line should be a tag separator character (ASCII 30), followed by a tag name (with or without occurrence), a space and a subfield separator (ASCII 31). The rest of the record is defined by repeated tag and subfield separators and the text between them. The record is considered complete at end of file, or if a new record separator is found.

Of course, I can use the sed trick, but I consider such things as dirty. I don't understand why the official format is not supported.

nichtich commented 2 years ago

generic format works in this version but creates a file without delimiters.

Generic format can support additional line breaks after each field so which delimiter is missing?

I don't understand why the official format is not supported

Because definition of "official" in PICA is subjective as no public validator existed. Over the years people have come up with many slightly differing practices (and continue to do so, e.g. https://github.com/gbv/PICA-Data/issues/83) and we have to deal with it. Catmandu-PICA has grown to support more and more of this practices as they came up to the authors. Given the documentation at https://wiki-cbs.oclc.org/wiki/images/Software_for_Data_Import.pdf I realize that the serialization format described is actually "official" enough to be supported as yet another serialization form - especially because it includes other elements not supported by PICA::Data so far such as comments.

I see two options:

P.S: See https://github.com/gbv/PICA-Data/issues/128 for proposal of implementation

powerriegel commented 2 years ago

Sorry, but can't follow you.

my $importer = Catmandu->importer(
        'PICA',
        type        => 'XML',
        file        => $ifile,
        skip_errors => 1
    );
    my $exporter = Catmandu->exporter(
        'PICA',
        type => 'normalized',
        file => $ofile
    );
    $exporter->add_many($importer);
    $exporter->commit();

Says: unknown PICA parser type: normalized at /usr/local/share/perl/5.30.0/Catmandu/Exporter/PICA.pm line 21.

"generic" in above code outputs this - there are just no delimiters.

002C $aText$btxt
002D $aComputermedien$bc
002E $aOnline-Ressource$bcr
004V $010.1159/000491814
005A $01938-2650
006X $ikarger$0491814
006X $ikarger$030396186
006X $ikarger$0Acta Cytologica 2019;63:10–16
006X $iEPF$0e2532e39-0765-4b2f-8b3b-2fce9291afa7
010@ $aeng
010E $arda
011@ $a2019
...

Versions:

$ perl -MPICA::Data -le 'print $PICA::Data::VERSION' 2.02 $ perl -MCatmandu::PICA -le 'print $Catmandu::PICA::VERSION' 1.08

nichtich commented 2 years ago

Catmandu::Exporter::PICA supports since version 1.08. Does this configuration in catmandu.yaml result in the normalized CBS pica format?

exporter:
  norm:
    package: PICA
    options:
      type: generic
      subfield_indicator: "\x1f"
      field_separator: "\x1e\n"
      record_separator: "\n"

Careful reading of the PDF shows that this will not exactly be the normalized format because the record separator is between records instead of before each record. The document also states that newlines between fields are optional.

powerriegel commented 2 years ago

Where do I have to store that catmandu.yaml?

BTW: The binary format seems to have a bug, too: 002@ aOsx002C aTextbtxt002D aComputermedienbc002E aOnline-Ressourcebcr004V 010.1159/000491814005A 01938-2650006X ikarger0491814006X ikarger030396186006X ikarger

You see that every field is preceded by \x1e except the first field 002@. This is no copy mistake.

nichtich commented 2 years ago

See https://metacpan.org/pod/Catmandu#CONFIG for Catmandu configuration.

The binary format also uses record and field separator in between records and fields, respectively. This is in line with ISO 2709 where \x1e is used as field terminator (FT)

powerriegel commented 2 years ago

generic mode with configuration

    my $cwd = getcwd();
    Catmandu->load("$cwd/config/");
    my $importer = Catmandu->importer(
        'PICA',
        type        => 'XML',
        file        => $ifile,
        skip_errors => 1
    );
    my $exporter = Catmandu->exporter(
        'PICA',
        type => 'generic',
        file => $ofile
    );
    $exporter->add_many($importer);
    $exporter->commit();

and the above catmandu.yml in the loaded path still leads to a file without delimiters. I also tried replacing "norm" by "default".

nichtich commented 2 years ago

my $exporter = Catmandu->exporter('norm', file => $ofile) should do and Catmandu->load("$cwd/config/") can be omitted.

nichtich commented 2 years ago

PICA::Data 2.03 add support of Serializing PICA type import. New release of Catmandu::PICA should be made once PICA::Data 2.04 has been released with support of parsing this format as well.

nichtich commented 1 year ago

Requires https://github.com/gbv/PICA-Data/issues/129

nichtich commented 1 year ago

Duplicate issue: #88, solved with upcoming release 1.15, using PICA::Data 1.20