gbv / Catmandu-PICA

Catmandu modules for working with PICA+ data
https://metacpan.org/release/Catmandu-PICA
Other
4 stars 4 forks source link

Invalid characters in xml #52

Closed Phu2 closed 6 years ago

Phu2 commented 6 years ago

I just came across a problem with character ranges below 0x20, which are present in initial plain pica files. Catmandu perfectly converts them to pica xml, but by doing so Catmandu actually generates invalid xml files, which cannot be parsed by xsl processors.

Example file (one record in plain pica, field 233P contains the unicode character 0x2) record.txt

Command catmandu convert -v PICA --type binary to PICA --type XML < record.txt > record.xml

The resulting xml cannot be transformed (e.g. to solr xml) via xmlstarlet, xsltproc or Saxon HE. For now i delete all characters which are not allowed in xml according to the specification by using

tr -d '\000-\010\013\014\016-\037'

I wonder if Catmandu should handle this problem and only generate valid xml. What do you think?

jorol commented 6 years ago

... Document authors are encouraged to avoid "compatibility characters" ...

I always recommend to sanitize your data before processing it with Catmandu or to use Catmandu::Fixes for this.

PICA::Writer::XML uses simple 'print'-statements to generate the XML document, so we can handle really large XML files. I will check if we could switch to a XML library like XML::Writer which would capture these errors.

jorol commented 6 years ago

Fixed in https://github.com/gbv/PICA-Data/releases/tag/0.34

Install new version:

$ cpanm PICA::Data