LibreCat / Catmandu

Catmandu - a data processing toolkit
https://librecat.org
175 stars 31 forks source link

Character encoding output question #365

Closed jasloe closed 5 years ago

jasloe commented 5 years ago

I am working with a dataset that contains some poorly encoded strings, i.e.:

=001  TR1311
=245  10$aÀ la Albéniz$h[electronic resource]

I am passing these records through a lookup with values containing the correct character set and encoding:

key,value
TR1311,À la Albéniz

e.g.:

marc_map(001,identifier)
lookup(identifier,'lookup.csv')
marc_replace_all('245',a,$.identifier)

This is working fine, however the output is not what I was expecting:

=001  TR1311
=245  10$a{copy} l{deg} la Alb{caron}niz Alb{copy}{flat}niz$h[electronic resource]

I'm not entirely clear what's going on here. All of the resources I am working with are in the UTF8 domain. Moreover, I understand Catmandu uses UTF8 default. I've tried converting from MarcMaker to ISO and vice versa without any luck. Ideas? Rather stumped....

phochste commented 5 years ago

@jasloe Indeed Catmandu uses UTF-8 by default there is on this page some hints how to preprocess files that have the wrong encoding: https://metacpan.org/pod/release/HOCHSTEN/Catmandu-MARC-1.251/lib/Catmandu/MARC/Tutorial.pod

In your fixes there is a wrong usage of the marc_replace_all command. This Fix needs three arguments:

  1. The MARC path you want to fix
  2. A regular expression of the parts of the MARC (sub)fields you want to replace
  3. The replacement string

In your case the marc_replace_all should have been written like:

marc_replace_all('245a','^.*$',$.identifier)

Or with the marc_set Fix this could be easier written as:

marc_set(245a,$.identifier)
nics commented 5 years ago

Can this be closed?