LibreCat / Catmandu

Catmandu - a data processing toolkit
https://librecat.org
175 stars 31 forks source link

Character encoding broken in MARC-in-JSON export #395

Closed CaptSolo closed 1 year ago

CaptSolo commented 1 year ago

When converting UTF-8 encoded MARC-XML records to MARC-in-JSON the resulting JSON encoding is broken (e.g. Latvian special characters get distorted / do not appear the same way they were in MARC/XML).

Command: convert MARC --type XML to MARC --type MiJ < 1_record_marc.xml > 1_record_marc.json

MARC-XML file is UTF-8 encoded and JSON output should also be UTF-8 by default.

Special characters in the XML file (attached to this ticket):

      <datafield tag="670" ind1=" " ind2=" ">
        <subfield code="a">Wyss, J.D. Šveices Robinsonu ģimene, 1996</subfield>
      </datafield>

Same characters after conversion to JSON:

{"670":{"ind2":" ","ind1":" ","subfields":[{"a":"Wyss, J.D. Šveices Robinsonu ģimene, 1996"}]}}

Could you suggest how to fix this problem?


XML file (had to change its extension to TXT so that Github would allow to upload it): 1_record_marc.xml.txt

phochste commented 1 year ago

This is a ticket for Catmandu::MARC . I create a fix in that project and pushed a new version 1.29 of Catmandu::MARC to cpan.

Can you install the 1.29 version of Catmandu::MARC and try again. Use the ticket system of that project for further feedback.